Run MPI On Grid'5000
Note | |
---|---|
This page is actively maintained by the Grid'5000 team. If you encounter problems, please report them (see the Support page). Additionally, as it is a wiki page, you are free to make minor corrections yourself if needed. If you would like to suggest a more fundamental change, please contact the Grid'5000 team. |
Running MPI on Grid'5000
When attempting to run MPI on Grid'5000 you'll be faced with a number of challenges, ranging from classical setup problems for MPI software to problems specific to Grid'5000. This practical session aims at driving you through the most common uses cases, which are:
- setting up and starting openMPI on a default environment using
oarsh
- setting up and starting openMPI to use high performance interconnect
- setting up and starting openMPI to run on several sites using
oargridsub
- setting up and starting openMPI on a default environment using a
allow_classic_ssh
- setting up and starting openMPI on a kadeploy image
Several versions of MPI exist: OpenMPI, MPICH2, MPICH, LAM, etc.
In this practical session, we will focus on OpenMPI.
Pre-requisite
- Basic knowledge of MPI; if you don't know MPI, you can read: Grid_computation. For a more comprehensive tutorials on MPI, see IDRIS courses on MPI
- Knowledge of OAR (Getting Started tutorial), and for the second part of this tutorial, basic knowledge of OARGRID (Advanced OAR tutorial) and Kadeploy (Getting Started tutorial)
Overview
Since june 2010 the same default environment is available on every sites, therefore, you can use the mpi version available on this environment (openmpi 1.4.5)
Using OpenMPI on a default environment
Create a sample MPI program
- We will use a vary basic MPI program to test OAR/MPI; create a file
$HOME/src/mpi/tp.c
and copy the following source:
the source code:
#include <stdio.h>
#include <mpi.h>
#include <time.h> /* for the work function only */
int main (int argc, char *argv []) {
char hostname[257];
int size, rank;
int i, pid;
int bcast_value = 1;
gethostname (hostname, sizeof hostname);
MPI_Init (&argc, &argv);
MPI_Comm_rank (MPI_COMM_WORLD, &rank);
MPI_Comm_size (MPI_COMM_WORLD, &size);
if (!rank) {
bcast_value = 42;
}
MPI_Bcast (&bcast_value,1 ,MPI_INT, 0, MPI_COMM_WORLD );
printf("%s\t- %d - %d - %d\n", hostname, rank, size, bcast_value);
fflush(stdout);
MPI_Barrier (MPI_COMM_WORLD);
MPI_Finalize ();
return 0;
}
This program will use mpi to communicate between processes; the mpi process of rank 0 will broadcast an integer (value 42) to all the others processes. Then, each process prints it's rank, the total number of processes, and the value he got from process zero.
Setting up and starting OpenMPI on a default environment using oarsh
- Compile your code
oarsh
is the default connector used when you reserve a node. To be able to use this connector, you need to add the option --mca plm_rsh_agent "oarsh"
to mpirun.
You can also set an environment variable (usually in your .bashrc):
OpenMPI also provides a config file solution. In your home, create a file as ~/.openmpi/mca-params.conf
plm_rsh_agent=oarsh filem_rsh_agent=oarcp
You should have something like:
helios-52 - 4 - 12 - 42 helios-51 - 0 - 12 - 42 helios-52 - 5 - 12 - 42 helios-51 - 2 - 12 - 42 helios-52 - 6 - 12 - 42 helios-51 - 1 - 12 - 42 helios-51 - 3 - 12 - 42 helios-52 - 7 - 12 - 42 helios-53 - 8 - 12 - 42 helios-53 - 9 - 12 - 42 helios-53 - 10 - 12 - 42 helios-53 - 11 - 12 - 42
You may have (lot's of) warning messages if openmpi doesn't find high performance hardware: don't be afraid, it's normal but you could use FAQ#How_to_use_MPI_in_Grid5000.3F to avoid them. This can looks like this:
[[2616,1],2]: A high-performance Open MPI point-to-point messaging module was unable to find any relevant network interfaces: Module: OpenFabrics (openib) Host: helios-8.sophia.grid5000.fr Another transport will be used instead, although this may result in lower performance. -------------------------------------------------------------------------- warning:regcache incompatible with malloc warning:regcache incompatible with malloc warning:regcache incompatible with malloc
or like this:
[griffon-80.nancy.grid5000.fr:04866] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_mtl_mx: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [griffon-80.nancy.grid5000.fr:04866] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_btl_mx: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [griffon-80.nancy.grid5000.fr:04865] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_mtl_mx: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [griffon-80.nancy.grid5000.fr:04867] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_mtl_mx: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) ...
exit the job.
Setting up and starting OpenMPI to use high performance interconnect
By default, openMPI tries to use any high performance interconnect he can find. This is true only if he has found the libraries at compile time (compilation of openmpi, not your application). This should be true if you have built OpenMPI on a wheezy-x64 environment, and this is true on the default environment.
We will be using the Netpipe tool to check if the high performance interconnect is really used: download it from this URL: http://www.scl.ameslab.gov/netpipe/code/NetPIPE-3.7.1.tar.gz
A copy is already available inside grid5000:
Unarchive Netpipe
Compile
Myrinet hardware :
Myrinet hardware is available on severals sites (see Hardware page):
- sophia (2G and 10G)
- lille (10G)
- bordeaux (2G and 10G)
To reserve one core on two nodes with a myrinet interconnect: Myrinet 2G
or Myrinet 10G
you should have something like that:
0: 1 bytes 4080 times --> 0.31 Mbps in 24.40 usec 1: 2 bytes 4097 times --> 0.63 Mbps in 24.36 usec ... 122: 8388608 bytes 3 times --> 896.14 Mbps in 71417.13 usec 123: 8388611 bytes 3 times --> 896.17 Mbps in 71414.83 usec
The minimum latency is given by the last column for a 1 byte message; the maximum throughput is given by the last line, 896.17 Mbps in this case. So in this case a latency of 24usec is very high, therefore the myrinet was not used as expected. It can happen if openmpi has not found the mx libraries during the compilation. You can check this with :
If the output is empty, there is no mx support builtin.
With a myrinet2G network, typical result should looks like this:
This time we have:
0: 1 bytes 23865 times --> 2.03 Mbps in 3.77 usec 1: 2 bytes 26549 times --> 4.05 Mbps in 3.77 usec ... 122: 8388608 bytes 3 times --> 1773.88 Mbps in 36079.17 usec 123: 8388611 bytes 3 times --> 1773.56 Mbps in 36085.69 usec
In this example, we have 3.77usec and almost 1.8Gbps
Infiniband hardware :
Infiniband hardware is available on severals sites (see Hardware page):
- rennes (20G)
- nancy (20G)
- bordeaux (10G & 20G)
- grenoble (20G & 40G)
To reserve one core on two nodes with a 10G infiniband interconnect:
or for 20G
Do exactly the same thing as for myrinet interconnect. To check if the support for infiniband is available in openmpi, run:
you should see something like this:
MCA btl: openib (MCA v2.0, API v2.0, Component v1.4.1)
With Infiniband 40G (QDR), you should have much better performance that ethernet or myrinet 2G:
0: 1 bytes 30716 times --> 4.53 Mbps in 1.68 usec 1: 2 bytes 59389 times --> 9.10 Mbps in 1.68 usec ... 121: 8388605 bytes 17 times --> 25829.13 Mbps in 2477.82 usec 122: 8388608 bytes 20 times --> 25841.35 Mbps in 2476.65 usec 123: 8388611 bytes 20 times --> 25823.40 Mbps in 2478.37 usec
Less than 2 microsec in latency and almost 26Gbit/s in bandwitdh !
More advanced use cases
Running MPI on several sites at once
In this tutorial, we use the following: rennes, sophia and grenoble. But you can now use any site. To do this, we will use oargrid, see Grid_jobs_management tutorial for more information.
Note | |
---|---|
For multiple sites, we may want to only use tcp , and not native mx and native infiniband; to do this, add this option to mpirun: --mca btl self,sm,tcp |
Synchronize the src/mpi directory, from the frontend (the tp binary must be available on all sites), to the two other sites. Here we supposed we are connected on sophia, and we want to synchronize to grenoble and rennes.
Reserve nodes on the 3 sites with oargridsub (you can reserve nodes from specific clusters if you want to).
frontend :
|
oargridsub -w 02:00:00 rennes :rdef="nodes=2",grenoble :rdef="nodes=2",sophia :rdef="nodes=2" > oargrid.out |
Get the oargrid Id and Job key from the output of oargridsub:
Get the node list using oargridstat:
connect to the first node:
And run mpirun:
node :
|
mpirun -machinefile ~/gridnodes --mca plm_rsh_agent "oarsh" --mca opal_net_private_ipv4 "192.168.160.0/24\;192.168.14.0/23" --mca btl_tcp_if_exclude ib0,lo,myri0 --mca btl self,sm,tcp tp |
Compilation [optional]
If you want to use a custom version of OpenMPI, you can compile it in your home directory.
- Make an interactive reservation and compile openmpi on a node :
- Get OpenMPI (or here : http://www.open-mpi.org/software/ompi/v1.4/)
Unarchive openmpi
configure (wait ~1mn30s)
The source tarball includes a small patch to make openmpi works on grid5000 on several sites simultaneously. For the curious, the patch is:
--- ompi/mca/btl/tcp/btl_tcp_proc.c.orig 2010-03-23 14:01:28.000000000 +0100 +++ ompi/mca/btl/tcp/btl_tcp_proc.c 2010-03-23 14:01:50.000000000 +0100 @@ -496,7 +496,7 @@ local_interfaces[i]->ipv4_netmask)) { weights[i][j] = CQ_PRIVATE_SAME_NETWORK; } else { - weights[i][j] = CQ_PRIVATE_DIFFERENT_NETWORK; + weights[i][j] = CQ_NO_CONNECTION; } best_addr[i][j] = peer_interfaces[j]->ipv4_endpoint_addr; }
and compile: (wait ~2mn30s)
- Install it on your home directory (in $HOME/openmpi/ )
Then you can do the same steps as before, but with $HOME/openmpi/bin/mpicc
and $HOME/openmpi/bin/mpirun
Setting up and starting OpenMPI on a kadeploy image
Building a kadeploy image
The default openmpi version available in debian based distributions are not compiled with high performances libraries like myrinet/MX, therefore we must recompile OpenMPI from sources. Fortunately, the default images (wheezy-x64-XXX) all but the min variant include the libraries for high performance interconnect, and OpenMPI will find them at compile time.
We will create a kadeploy image based on an existing one.
Download openmpi tarball if you don't already have it :
Copy openmpi tarball on the first node:
Then connect on the deployed node as root, and install openmpi:
Unarchive openmpi
Add gfortran,f2c and blas library
Configure and compile
Create a dedicated user named mpi, in the group rdma (for infiniband)
useradd -m -g rdma mpi -d /var/mpi echo "* hard memlock unlimited" >> /etc/security/limits.conf echo "* soft memlock unlimited" >> /etc/security/limits.conf mkdir ~mpi/.ssh cp ~/.ssh/authorized_keys ~mpi/.ssh chown -R mpi ~mpi/.ssh su - mpi mkdir src ssh-keygen -N "" -P "" -f /var/mpi/.ssh/id_rsa cat .ssh/id_rsa.pub >> ~/.ssh/authorized_keys echo " StrictHostKeyChecking no" >> ~/.ssh/config exit exit rsync -avz ~/src/mpi/ mpi@`head -1 $OAR_NODEFILE`:src/mpi/ ssh root@`head -1 $OAR_NODEFILE`
Create the image using tgz-g5k
Disconnect from node, and copy the image from the frontend; first, create a public directory:
Copy the description file of wheezy-x64-base
frontend :
|
grep -v visibility /grid5000/descriptions/wheezy-x64-base-1.1.dsc > $HOME/public/wheezy-openmpi.dsc
|
Change the image name in the description file; we will use an http URL for multi-site deploiement, replace SITE by the site were you are currently connected:
perl -i -pe "s@/grid5000/images/wheezy-x64-base-1.1.tgz@http://public.$(hostname | cut -d. -f2).grid5000.fr/~$USER/wheezy-openmpi.tgz@" $HOME/public/wheezy-openmpi.dsc
Now you can finish the job (exit)
Using a kadeploy image
Single site
connect to the first node:
Single site with Myrinet hardware
Create a nodefile with a single entry per node:
Copy it to the first node:
connect to the first node:
This time we have:
0: 1 bytes 23865 times --> 2.03 Mbps in 3.77 usec 1: 2 bytes 26549 times --> 4.05 Mbps in 3.77 usec ... 122: 8388608 bytes 3 times --> 1773.88 Mbps in 36079.17 usec 123: 8388611 bytes 3 times --> 1773.56 Mbps in 36085.69 usec
This time we have 3.77usec, which is good, and almost 1.8Gbps. We are using the myrinet interconnect!
Multiple sites
Choose three clusters from 3 different sites.
frontend :
|
oargridsub -t deploy -w 02:00:00 cluster1 :rdef="nodes=2",cluster2 :rdef="nodes=2",cluster3 :rdef="nodes=2" > oargrid.out |
Get the node list using oargridstat:
Deploy on all sites using the --multi-server option :
frontend :
|
kadeploy3 -f gridnodes -a $HOME/public/wheezy-openmpi.dsc -k --multi-server -o ~/nodes.deployed |
connect to the first node:
Setting up and starting OpenMPI on a default environment using allow_classic_ssh
Submit a job with the allow_classic_ssh
type
- Use this script to launch
Mpich2
If you want/need to use MPICH2 on Grid5000, you should do this:
First, you have to do this once (on each site)
Then you can use a script like this to launch mpd/mpirun:
NODES=`uniq < $OAR_NODEFILE | wc -l | tr -d ' '`
NPROCS=`wc -l < $OAR_NODEFILE | tr -d ' '`
mpdboot --rsh=oarsh --totalnum=$NODES --file=$OAR_NODEFILE
sleep 1
mpirun -n $NPROCS mpich2binary