Run MPI On Grid'5000

From Grid5000
Revision as of 17:40, 27 January 2016 by Jgaidamour (talk | contribs)
Jump to navigation Jump to search
Note.png Note

This page is actively maintained by the Grid'5000 team. If you encounter problems, please report them (see the Support page). Additionally, as it is a wiki page, you are free to make minor corrections yourself if needed. If you would like to suggest a more fundamental change, please contact the Grid'5000 team.

Running MPI on Grid'5000

When attempting to run MPI on Grid'5000 you'll be faced with a number of challenges, ranging from classical setup problems for MPI software to problems specific to Grid'5000. This practical session aims at driving you through the most common use cases, which are:

  • Setting up and starting Open MPI on a default environment using oarsh.
  • Setting up and starting Open MPI to use high performance interconnect.
  • Setting up and starting Open MPI to run on several sites using oargridsub.
  • Setting up and starting Open MPI on a default environment using a allow_classic_ssh.
  • Setting up and starting Open MPI on a kadeploy image.

Several implementation of MPI exist: Open MPI, MPICH2, MPICH, LAM, etc.

In this practical session, we will focus on Open MPI.

Pre-requisite

Overview

Since june 2010 the same default environment is available on every sites, therefore, you can use the default MPI library available on this environment (Open MPI 1.4.5).

Using Open MPI on a default environment

Create a sample MPI program

  • We will use a very basic MPI program to test OAR/MPI. Create a file $HOME/src/mpi/tp.c and copy the following source code:
Terminal.png frontend:
mkdir -p $HOME/src/mpi
Terminal.png frontend:
vi $HOME/src/mpi/tp.c

the source code:

#include <stdio.h>
#include <mpi.h>
#include <time.h> /* for the work function only */

int main (int argc, char *argv []) {
       char hostname[257];
       int size, rank;
       int i, pid;
       int bcast_value = 1;

       gethostname (hostname, sizeof hostname);
       MPI_Init (&argc, &argv);
       MPI_Comm_rank (MPI_COMM_WORLD, &rank);
       MPI_Comm_size (MPI_COMM_WORLD, &size);
       if (!rank) {
            bcast_value = 42;
       }
       MPI_Bcast (&bcast_value,1 ,MPI_INT, 0, MPI_COMM_WORLD );
       printf("%s\t- %d - %d - %d\n", hostname, rank, size, bcast_value);
       fflush(stdout);

       MPI_Barrier (MPI_COMM_WORLD);
       MPI_Finalize ();
       return 0;
}

This program uses MPI to communicate between processes; the MPI process of rank 0 will broadcast an integer (value 42) to all the other processes. Then, each process prints its rank, the total number of processes, and the value he got from process 0.

Setting up and starting Open MPI on a default environment using oarsh

Submit a job:

Terminal.png frontend:
oarsub -I -l nodes=3

Compile your code:

Terminal.png node:
mpicc src/mpi/tp.c -o src/mpi/tp

oarsh is the default connector used when you reserve a node. To be able to use this connector, you need to add the option --mca plm_rsh_agent "oarsh" to mpirun.

Terminal.png node:
mpirun --mca plm_rsh_agent "oarsh" -machinefile $OAR_NODEFILE $HOME/src/mpi/tp

You can also set an environment variable (usually in your .bashrc):

Terminal.png bashrc:
export OMPI_MCA_plm_rsh_agent=oarsh

Open MPI also provides a config file solution. In your home, create a file as ~/.openmpi/mca-params.conf

plm_rsh_agent=oarsh
filem_rsh_agent=oarcp

You should have something like:

helios-52       - 4 - 12 - 42
helios-51       - 0 - 12 - 42
helios-52       - 5 - 12 - 42
helios-51       - 2 - 12 - 42
helios-52       - 6 - 12 - 42
helios-51       - 1 - 12 - 42
helios-51       - 3 - 12 - 42
helios-52       - 7 - 12 - 42
helios-53       - 8 - 12 - 42
helios-53       - 9 - 12 - 42
helios-53       - 10 - 12 - 42
helios-53       - 11 - 12 - 42

You may have (lot's of) warning messages if Open MPI doesn't find high performance hardware: don't be afraid, it's normal but you could use FAQ#How_to_use_MPI_in_Grid5000.3F to avoid them. This can looks like this:

[[2616,1],2]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: helios-8.sophia.grid5000.fr

Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
warning:regcache incompatible with malloc
warning:regcache incompatible with malloc
warning:regcache incompatible with malloc

or like this:

[griffon-80.nancy.grid5000.fr:04866] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_mtl_mx: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
[griffon-80.nancy.grid5000.fr:04866] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_btl_mx: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
[griffon-80.nancy.grid5000.fr:04865] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_mtl_mx: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
[griffon-80.nancy.grid5000.fr:04867] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_mtl_mx: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
...

exit the job.

Setting up and starting Open MPI to use high performance interconnect

By default, Open MPI tries to use any high performance interconnect he can find. But it works only if the related libraries were found during the compilation of Open Mpi (not during the compilation of your application). It should work if you built Open MPI on a wheezy-x64 environment, and it also works correctly on the default environment.

Note.png Note

If you want to disable support for high performance network, use --mca btl self,sm,tcp, but beware: Open MPI will use all the TCP networks available, and so IPoverIB will be used also if available; use --mca btl_tcp_if_exclude ib0,lo,myri0 to disable also IP emulation of high performance interconnect

We will be using the Netpipe tool to check if the high performance interconnect is really used: download it from this URL: http://pkgs.fedoraproject.org/repo/pkgs/NetPIPE/NetPIPE-3.7.1.tar.gz/5f720541387be065afdefc81d438b712/NetPIPE-3.7.1.tar.gz

Warning.png Warning

NetPipe runs only between two MPI processes. Please make sure you have only two lines within your node file. Otherwise, MPI will launch more than two processes, which is incompatible with NetPipe.

Note.png Note

Think to configure proxy if wget freezes on "connecting", see concerned part in Getting Started

Terminal.png frontend:
oarsub -I
Terminal.png node:
cd $HOME/src/mpi

Unarchive Netpipe

Terminal.png node:
tar zvxf ~/NetPIPE-3.7.1.tar.gz
Terminal.png node:
cd NetPIPE-3.7.1

Compile

Terminal.png node:
make mpi

Infiniband hardware :

Infiniband hardware is available on severals sites (see Hardware page):

  • Rennes (20G)
  • Nancy (20G)
  • Grenoble (20G & 40G)

To reserve one core on two nodes with a 10G infiniband interconnect:

Terminal.png frontend:
oarsub -I -l /nodes=2/core=1 -p "ib10g='YES'"

or for 20G:

Terminal.png frontend:
oarsub -I -l /nodes=2/core=1 -p "ib20g='YES'"

To test the network:

Terminal.png node:
cd $HOME/src/mpi/NetPIPE-3.7.1
Terminal.png node:
mpirun --mca plm_rsh_agent "oarsh" -machinefile $OAR_NODEFILE NPmpi

To check if the support for infiniband is available in Open MPI, run:

Terminal.png node:
ompi_info | grep openib

you should see something like this:

                MCA btl: openib (MCA v2.0, API v2.0, Component v1.4.1)

With Infiniband 40G (QDR), you should have much better performance that using ethernet or Myrinet 2G:

 0:       1 bytes  30716 times -->      4.53 Mbps in       1.68 usec
 1:       2 bytes  59389 times -->      9.10 Mbps in       1.68 usec
...
121: 8388605 bytes     17 times -->  25829.13 Mbps in    2477.82 usec
122: 8388608 bytes     20 times -->  25841.35 Mbps in    2476.65 usec
123: 8388611 bytes     20 times -->  25823.40 Mbps in    2478.37 usec

Less than 2 microsec in latency and almost 26Gbit/s in bandwitdh !

Myrinet hardware :

Myrinet hardware is available on severals sites (see Hardware page):

  • Lille (10G)
  • Sophia (10G)

To reserve one core on two nodes with a 10G Myrinet interconnect:

Myrinet 10G:

Terminal.png frontend:
oarsub -I -l /nodes=2/core=1 -p "myri10g='YES'"

To test the network:

Terminal.png node:
cd $HOME/src/mpi/NetPIPE-3.7.1
Terminal.png node:
mpirun --mca plm_rsh_agent "oarsh" -machinefile $OAR_NODEFILE NPmpi

you should have something like that:

 0:         1 bytes   4080 times -->      0.31 Mbps in      24.40 usec     
 1:         2 bytes   4097 times -->      0.63 Mbps in      24.36 usec     
 ...
 122: 8388608 bytes      3 times -->    896.14 Mbps in   71417.13 usec
 123: 8388611 bytes      3 times -->    896.17 Mbps in   71414.83 usec

The minimum latency is given by the last column for a 1 byte message; the maximum throughput is given by the last line, 896.17 Mbps in this case. So in this case a latency of 24usec is very high, therefore the myrinet was not used as expected. It can happen if Open MPI has not found the mx libraries during the compilation. You can check this with :

Terminal.png node:
ompi_info | grep mx

If the output is empty, there is no mx support builtin.

With a myrinet2G network, typical result should looks like this:

This time we have:

  0:       1 bytes  23865 times -->      2.03 Mbps in       3.77 usec     
  1:       2 bytes  26549 times -->      4.05 Mbps in       3.77 usec     
...
122: 8388608 bytes      3 times -->   1773.88 Mbps in   36079.17 usec
123: 8388611 bytes      3 times -->   1773.56 Mbps in   36085.69 usec

In this example, we have 3.77usec and almost 1.8Gbps

More advanced use cases

Running MPI on several sites at once

In this tutorial, we use the following sites: rennes, sophia and grenoble. But you can now use any site. For running MPI applications on several site, we will be using oargrid. See Grid_jobs_management tutorial for more information.

Warning.png Warning

There is still a problem when using lille and luxembourg nodes simultaneously.

Warning.png Warning

Open MPI tries to figure out the best network interface at run time, and he also assumes that some networks are not routed between sites. To avoid this kind of problems, we must add the option --mca opal_net_private_ipv4 "192.168.160.0/24\;192.168.14.0/23" --mca btl_tcp_if_exclude ib0,lo,myri0 to mpirun

Note.png Note

For multiple sites, we may want to only use tcp , and not native mx and native infiniband; to do this, add this option to mpirun: --mca btl self,sm,tcp


Synchronize the src/mpi directory, from the frontend (the tp binary must be available on all sites), to the two other sites. Here we supposed we are connected on sophia, and we want to synchronize to grenoble and rennes.

Terminal.png frontend.sophia:
ssh rennes mkdir -p src/mpi/
Terminal.png frontend.sophia:
rsync --delete -avz ~/src/mpi/ rennes.grid5000.fr:src/mpi/
Terminal.png frontend.sophia:
ssh grenoble mkdir -p src/mpi/
Terminal.png frontend.sophia:
rsync --delete -avz ~/src/mpi/ grenoble.grid5000.fr:src/mpi/

Reserve nodes on the 3 sites with oargridsub (you can reserve nodes from specific clusters if you want to).

Terminal.png frontend:
oargridsub -w 02:00:00 rennes:rdef="nodes=2",grenoble:rdef="nodes=2",sophia:rdef="nodes=2" > oargrid.out

Get the oargrid Id and Job key from the output of oargridsub:

Terminal.png frontend:
export OAR_JOB_KEY_FILE=`grep "SSH KEY" oargrid.out | cut -f2 -d: | tr -d " "`
Terminal.png frontend:
export OARGRID_JOB_ID=`grep "Grid reservation id" oargrid.out | cut -f2 -d=`

Get the node list using oargridstat and copy the list to the first node:

Terminal.png frontend:
oargridstat -w -l $OARGRID_JOB_ID | grep -v ^$ > ~/gridnodes
Terminal.png frontend:
oarcp ~/gridnodes `head -1 ~/gridnodes`:

Connect to the first node:

Terminal.png frontend:
oarsh `head -1 ~/gridnodes`

And run your MPI application:

Terminal.png node:
cd $HOME/src/mpi/
Terminal.png node:
mpirun -machinefile ~/gridnodes --mca plm_rsh_agent "oarsh" --mca opal_net_private_ipv4 "192.168.160.0/24\;192.168.14.0/23" --mca btl_tcp_if_exclude ib0,lo,myri0 --mca btl self,sm,tcp tp

Compilation [optional]

If you want to use a custom version of Open MPI, you can compile it in your home directory.

  • Make an interactive reservation and compile Open MPI on a node :
Terminal.png frontend:
oarsub -I
Terminal.png node:
cd /tmp/
Terminal.png frontend:
export http_proxy=http://proxy:3128/

Unarchive Open MPI

Terminal.png node:
tar -xf openmpi-1.4.3-g5k.tar.gz
Terminal.png node:
cd openmpi-1.4.3

configure (wait ~1mn30s)

Terminal.png node:
./configure --prefix=$HOME/openmpi/ --with-memory-manager=none

The source tarball includes a small patch to make Open MPI works on grid5000 on several sites simultaneously. For the curious, the patch is:

--- ompi/mca/btl/tcp/btl_tcp_proc.c.orig        2010-03-23 14:01:28.000000000 +0100
+++ ompi/mca/btl/tcp/btl_tcp_proc.c     2010-03-23 14:01:50.000000000 +0100
@@ -496,7 +496,7 @@
                                 local_interfaces[i]->ipv4_netmask)) {
                         weights[i][j] = CQ_PRIVATE_SAME_NETWORK;
                     } else {
-                        weights[i][j] = CQ_PRIVATE_DIFFERENT_NETWORK;
+                        weights[i][j] = CQ_NO_CONNECTION;
                     }
                     best_addr[i][j] = peer_interfaces[j]->ipv4_endpoint_addr;
                 }

and compile: (wait ~2mn30s)

Terminal.png node:
make -j4
  • Install it on your home directory (in $HOME/openmpi/ )
Terminal.png node:
make install

Then you can do the same steps as before, but with $HOME/openmpi/bin/mpicc and $HOME/openmpi/bin/mpirun

Setting up and starting Open MPI on a kadeploy image

Building a kadeploy image

The default Open MPI version available in debian based distributions is not compiled with high performances libraries like myrinet/MX, therefore we must recompile Open MPI from sources. Fortunately, every default image (wheezy-x64-XXX) but the min variant includes the libraries for high performance interconnects, and Open MPI will find them at compile time.

We will create a kadeploy image based on an existing one.

Terminal.png frontend:
oarsub -I -t deploy -l nodes=1,walltime=2
Terminal.png frontend:
kadeploy3 -f $OAR_NODEFILE -e wheezy-x64-base -k

Download Open MPI tarball if you don't already have it:

Terminal.png frontend:
export http_proxy=http://proxy:3128/

Copy Open MPI tarball on the first node:

Terminal.png frontend:
scp openmpi-1.4.3-g5k.tar.gz root@`head -1 $OAR_NODEFILE`:/tmp

Then connect on the deployed node as root, and install openmpi:

Terminal.png frontend:
ssh root@`head -1 $OAR_NODEFILE`

Unarchive Open MPI:

Terminal.png node:
cd /tmp/
Terminal.png node:
tar -xf openmpi-1.4.3-g5k.tar.gz
Terminal.png node:
cd openmpi-1.4.3

Install gfortran, f2c and blas library:

Terminal.png node:
apt-get -y install gfortran f2c libblas-dev

Configure and compile:

Terminal.png node:
./configure --libdir=/usr/local/lib64 --with-memory-manager=none
Terminal.png node:
make -j4
Terminal.png node:
make install

Create a dedicated user named mpi, in the group rdma (for infiniband)

useradd -m -g rdma mpi -d /var/mpi
echo "* hard memlock unlimited" >> /etc/security/limits.conf
echo "* soft memlock unlimited" >> /etc/security/limits.conf
mkdir ~mpi/.ssh
cp ~/.ssh/authorized_keys ~mpi/.ssh
chown -R mpi ~mpi/.ssh
su - mpi
mkdir src
ssh-keygen -N "" -P "" -f /var/mpi/.ssh/id_rsa
cat .ssh/id_rsa.pub >> ~/.ssh/authorized_keys
echo "        StrictHostKeyChecking no" >> ~/.ssh/config
exit
exit
rsync -avz ~/src/mpi/ mpi@`head -1 $OAR_NODEFILE`:src/mpi/
ssh root@`head -1 $OAR_NODEFILE`

Create the image using tgz-g5k:

Terminal.png node:
tgz-g5k /dev/shm/image.tgz

Disconnect from the node (exit). From the frontend, copy the image to the public directory:

Terminal.png frontend:
mkdir -p $HOME/public
Terminal.png frontend:
scp root@`head -1 $OAR_NODEFILE`:/dev/shm/image.tgz $HOME/public/wheezy-openmpi.tgz

Copy the description file of wheezy-x64-base:

Terminal.png frontend:
grep -v visibility /grid5000/descriptions/wheezy-x64-base-1.4.dsc > $HOME/public/wheezy-openmpi.dsc

Change the image name in the description file; we will use an http URL for multi-site deploiement:

perl -i -pe "s@server:///grid5000/images/wheezy-x64-base-1.4.tgz@http://public.$(hostname | cut -d. -f2).grid5000.fr/~$USER/wheezy-openmpi.tgz@" $HOME/public/wheezy-openmpi.dsc

Now you can terminate the job:

Terminal.png frontend:
oardel $OAR_JOB_ID

Using a kadeploy image

Single site

Terminal.png frontend:
oarsub -I -t deploy -l /nodes=3
Terminal.png frontend:
kadeploy3 -a $HOME/public/wheezy-openmpi.dsc -f $OAR_NODEFILE -k
Terminal.png frontend:
scp $OAR_NODEFILE mpi@`head -1 $OAR_NODEFILE`:nodes

connect to the first node:

Terminal.png frontend:
ssh mpi@`head -1 $OAR_NODEFILE`
Terminal.png node:
cd $HOME/src/mpi/
Terminal.png node:
/usr/local/bin/mpicc tp.c -o tp
Terminal.png node:
/usr/local/bin/mpirun -machinefile ~/nodes ./tp

Single site with Myrinet hardware

Terminal.png frontend:
oarsub -I -t deploy -l /nodes=2 -p "myri10g='YES'"
Terminal.png frontend:
kadeploy3 -k -a ~/public/wheezy-openmpi.dsc -f $OAR_NODEFILE

Create a nodefile with a single entry per node:

Terminal.png frontend:
uniq $OAR_NODEFILE > nodes

Copy it to the first node:

Terminal.png frontend:
scp nodes mpi@`head -1 nodes`:

connect to the first node:

Terminal.png frontend:
ssh mpi@`head -1 nodes`
Terminal.png node:
cd $HOME/src/mpi/NetPIPE-3.7.1
Terminal.png node:
/usr/local/bin/mpirun -machinefile ~/nodes NPmpi

This time we have:

  0:       1 bytes  23865 times -->      2.03 Mbps in       3.77 usec     
  1:       2 bytes  26549 times -->      4.05 Mbps in       3.77 usec     
...
122: 8388608 bytes      3 times -->   1773.88 Mbps in   36079.17 usec
123: 8388611 bytes      3 times -->   1773.56 Mbps in   36085.69 usec

This time we have 3.77usec, which is good, and almost 1.8Gbps. We are using the myrinet interconnect!

Multiple sites

Choose three clusters from 3 different sites.

Terminal.png frontend:
oargridsub -t deploy -w 02:00:00 cluster1:rdef="nodes=2",cluster2:rdef="nodes=2",cluster3:rdef="nodes=2" > oargrid.out
Terminal.png frontend:
export OARGRID_JOB_ID=`grep "Grid reservation id" oargrid.out | cut -f2 -d=`

Get the node list using oargridstat:

Terminal.png frontend:
oargridstat -w -l $OARGRID_JOB_ID |grep grid > ~/gridnodes


Deploy on all sites using the --multi-server option :

Terminal.png frontend:
kadeploy3 -f gridnodes -a $HOME/public/wheezy-openmpi.dsc -k --multi-server -o ~/nodes.deployed
Terminal.png frontend:
scp ~/nodes.deployed mpi@`head -1 ~/nodes.deployed`:

connect to the first node:

Terminal.png frontend:
ssh mpi@`head -1 ~/nodes.deployed`
Terminal.png node:
cd $HOME/src/mpi/
Terminal.png node:
/usr/local/bin/mpirun -machinefile ~/nodes.deployed --mca btl self,sm,tcp --mca opal_net_private_ipv4 "192.168.7.0/24\;192.168.162.0/24\;192.168.160.0/24\;172.24.192.0/18\;172.24.128.0/18\;192.168.200.0/23" tp

Setting up and starting Open MPI on a default environment using allow_classic_ssh

Submit a job with the allow_classic_ssh type:

Terminal.png frontend:
oarsub -I -t allow_classic_ssh -l nodes=3

Launch your parallel job:

Terminal.png node:
mpirun -machinefile $OAR_NODEFILE $HOME/src/mpi/tp

MPICH2

Warning.png Warning

This documentation is about using MPICH2 with the MPD process manager. But the default process manager for MPICH2 is now Hydra. See also: The MPICH documentation.

If you want/need to use MPICH2 on Grid5000, you should do this:

First, you have to do this once (on each site)

Terminal.png frontend:
echo "MPD_SECRETWORD=secret" > $HOME/.mpd.conf
Terminal.png frontend:
chmod 600 $HOME/.mpd.conf

Then you can use a script like this to launch mpd/mpirun:

NODES=`uniq < $OAR_NODEFILE | wc -l | tr -d ' '`
NPROCS=`wc -l < $OAR_NODEFILE | tr -d ' '`
mpdboot --rsh=oarsh --totalnum=$NODES --file=$OAR_NODEFILE
sleep 1
mpirun -n $NPROCS mpich2binary