Execo Practical Session: Difference between revisions

From Grid5000
Jump to navigation Jump to search
Line 79: Line 79:
  >>> compilation = SshProcess('cd execo_tutorial/NPB3.3-MPI && make clean && make suite', nodes[0])
  >>> compilation = SshProcess('cd execo_tutorial/NPB3.3-MPI && make clean && make suite', nodes[0])
  >>> compilation.run()
  >>> compilation.run()
We can see a summary of the compilation process:
>>> print compilation
It should be ok:
>>> compilation.ok
We can also have a detailed look at compilation outputs if needed:
>>> print compilation.stdout
>>> print compilation.stderr


The program is ready to be run.
The program is ready to be run.

Revision as of 13:45, 27 May 2014

Overview

The goal of this practical session is to show how to use execo ([1]) to quickly and easily prototype / develop reproducible experiments. The aim in this session is to show the issues of experiment development as faced by typical experimenters, when using grid5000, and how execo can help them being more productive and getting more reproducible results.

This practical session will start by showing how to use execo to interactively develop the different steps of an experiment, then it will show how to use execo to transform this prototype in a fully automatic, configurable, robust experiment engine, producing reproducible results, and able to run on a much larger parameters space.

Tool: execo

execo offers a Python API for local or remote, standalone or parallel, unix processes execution . It is especially well suited for quickly and easily scripting workflows of parallel/distributed operations on local or remote hosts: automate a scientific workflow, conduct computer science experiments, perform automated tests, etc. The core python package is execo. The execo_g5k package provides a set of tools and extensions for the grid5000 testbed. The execo_engine package provides tools to ease the development of computer sciences experiments.

Tutorial requirements

This tutorial requires users to know basic python and be reasonably familiar with grid5000 usage (this is not an absolute beginner session). During the tutorial, users will need to reserve a few nodes on at least two different clusters.

Detailed session program

The use case (the experiment which we will use as a support to illustrate execo functionality) is to benchmark an MPI application on different grid5000 clusters.

execo introduction

todo: brief introduction to execo: link to wg slides or include directly part of wg slides here

execo installation

On a grid5000 frontend, run:

$ export http_proxy="http://proxy:3128"
$ export https_proxy="http://proxy:3128"
$ easy_install --user execo

To check that everything is setup correctly, run a simple hello world:

$ ipython
In [1]: import execo
In [2]: execo.Process("echo 'hello, world'").run().stdout
Out[2]: 'hello, world\n'

Prototype the experiment interactively

Let's start by creating a directory for the tutorial, on the frontend:

$ mkdir ~/execo_tutorial && cd ~/execo_tutorial

From now on, all commands prefixed by >>> are to be run in a python shell, preferably ipython, which is more user-friendly. All python sessions should import the execo modules:

$ ipython
>>> from execo import *
>>> from execo_g5k import *

Reserve some grid5000 compute nodes

Let's reserve some nodes on a site, for example lyon:

>>> jobs = oarsub([(OarSubmission("cluster=1/nodes=2",
...                               walltime=7200,
...                               job_type="allow_classic_ssh"), "lyon")])
>>> jobs
[(<jobid>, 'lyon')]

We can get informations on the job:

>>> get_oar_job_info(*jobs[0])

And get the list of nodes:

>>> nodes = get_oar_job_nodes(*jobs[0])
>>> nodes
[Host('<node1>.lyon.grid5000.fr'),
 Host('<node2>.lyon.grid5000.fr')]

Copy some files / data to the compute nodes

todo: on fait pas ce step, nfs

Configure, compile and install the benchmark program on one node

We will use one of the NPB bench, namely a LU decomposition, that performs a linear system solver. Downloading the benchmark, extract it:

$ wget http://public.lyon.grid5000.fr/~lpouilloux/NPB3.3-MPI.tar.bz2
$ tar -xjf NPB3.3-MPI.tar.bz2

Compile it on a node (not on a frontend, because it's forbidden ;-)... and because we need mpif77):

>>> compilation = SshProcess('cd execo_tutorial/NPB3.3-MPI && make clean && make suite', nodes[0])
>>> compilation.run()

We can see a summary of the compilation process:

>>> print compilation

It should be ok:

>>> compilation.ok

We can also have a detailed look at compilation outputs if needed:

>>> print compilation.stdout
>>> print compilation.stderr

The program is ready to be run.

Run the benchmark program

We first need to retrieve the numbe of core of the cluster thanks to the grid5000 API

>>> n_core = get_host_attributes(nodes[0])['architecture']['smt_size'] * len(nodes)
>>> run = SshProcess('mpirun -n ' + str(n_core)+' -mca btl sm,self  ~/NPB3.3-MPI/bin/lu.A.'+str(comb['n_core']
  • Retrieve the results
  • Draw some simple figure showing the results (graph drawing code will be supplied, as this code is out of topic)

Transform this prototype in an automated experiment engine

  • Inherit a class from g5k_cluster_engine, which is a supplied, generic and reusable execo experiment engine automatizing the workflow of submitting jobs in parallel to grid5000 clusters / sites. It is well suited for bag-of-task kind of jobs, where the cluster is one of the experiment parameter, eg. benching flops, benching storage, network, etc.
  • Automate the workflow of section #Prototype the experiment interactively, for one cluster
  • Choose a parameter to explore, and hard-code the variation of this parameter.
  • Show how to use the ParamSweeper facility to easily explore a much larger parameter space, with the benefit of check-pointing the progress, allowing stopping and restarting the experiment.
  • Draw the same figure as in section #Prototype the experiment interactively, with much more data.

,