Run MPI On Grid'5000: Difference between revisions

From Grid5000
Jump to navigation Jump to search
No edit summary
(Undo revision 57077 by Jgaidamour (talk))
Line 3: Line 3:
{{Portal|User}}
{{Portal|User}}
{{Portal|Tutorial}}
{{Portal|Tutorial}}
{{Portal|MPI}}
{{Pages|HPC}}
{{Portal|HPC}}
{{TutorialHeader}}
{{TutorialHeader}}
= Running MPI on Grid'5000 =
 
= Introduction =
 
[https://en.wikipedia.org/wiki/Message_Passing_Interface MPI] is a programming interface that enables the communication between processes of a distributed memory system. This tutorial focus on setting up MPI environments on Grid'5000 and only requires a basic understanding of MPI concepts.
For instance, you should know that standard MPI processes live in their own memory space and communicate with other processes by calling library routines to send and receive messages. For a comprehensive tutorials on MPI, see the [http://www.idris.fr/data/cours/parallel/mpi/choix_doc.html IDRIS course on MPI]. There are several freely-available implementations of MPI, including Open MPI, MPICH2, MPICH, LAM, etc. In this practical session, we focus on the [http://www.open-mpi.org Open MPI] implementation.
 
Before following this tutorial you should already have some basic knowledge of OAR (see the [[Getting Started]] tutorial) . For the second part of this tutorial, you should also know the basics about OARGRID (see the [[Advanced OAR]] tutorial) and Kadeploy (see the [[Getting Started]] tutorial).
 
== Running MPI on Grid'5000 ==
When attempting to run MPI on Grid'5000 you'll be faced with a number of challenges, ranging from classical setup problems for MPI software to problems specific to Grid'5000. This practical session aims at driving you through the most common use cases, which are:
When attempting to run MPI on Grid'5000 you'll be faced with a number of challenges, ranging from classical setup problems for MPI software to problems specific to Grid'5000. This practical session aims at driving you through the most common use cases, which are:
* Setting up and starting Open MPI on a default environment using <code class='command'>oarsh</code>.
* Setting up and starting Open MPI on a default environment using <code class='command'>oarsh</code>.
* Setting up and starting Open MPI on a default environment using a <code class='command'>allow_classic_ssh</code>.
* Setting up and starting Open MPI to use high performance interconnect.
* Setting up and starting Open MPI to use high performance interconnect.
* Setting up and starting Open MPI to run on several sites using <code class='command'>oargridsub</code>.
* Setting up and starting Open MPI to run on several sites using <code class='command'>oargridsub</code>.
* Setting up and starting Open MPI on a default environment using a <code class='command'>allow_classic_ssh</code>.
* Setting up and starting Open MPI on a kadeploy image.
* Setting up and starting Open MPI on a kadeploy image.


Several implementation of MPI exist: Open MPI, MPICH2, MPICH, LAM, etc.
= Using Open MPI on a default environment =
 
In this practical session, we will focus on [http://www.open-mpi.org Open MPI].
 
=Pre-requisite=
* Basic knowledge of MPI; if you don't know MPI, you can read: [[Grid_computation]]. For a more comprehensive tutorials on MPI, see [http://www.idris.fr/data/cours/parallel/mpi/choix_doc.html IDRIS courses on MPI].
* Knowledge of OAR ([[Getting Started]] tutorial), and for the second part of this tutorial, basic knowledge of OARGRID ([[Advanced OAR]] tutorial) and Kadeploy ([[Getting Started]] tutorial).
 
=Overview=
Since june 2010 the same default environment is available on every sites, therefore, you can use the default MPI library available on this environment (Open MPI 1.4.5).
 
= Using Open MPI on a default environment=
 


The default Grid'5000 environment provides Open MPI 1.6.5 (see ompi_info).


== Creating a sample MPI program ==


==Create a sample MPI program==
For the purposes of this tutorial, we create a simple MPI program where the MPI process of rank 0 broadcasts an integer (42) to all the other processes. Then, each process prints its rank, the total number of processes and the value he received from the process 0.
* We will use a very basic MPI program to test OAR/MPI. Create a file <code class="file">$HOME/src/mpi/tp.c</code> and copy the following source code:


{{Term|location=frontend|cmd=<code class="command">mkdir</code> -p $HOME/src/mpi}}
On your home directory, create a file <code class="file">~/mpi/tp.c</code> and copy the source code:
{{Term|location=frontend|cmd=<code class="command">vi</code> $HOME/src/mpi/tp.c}}


the source code:
{{Term|location=frontend|cmd=<code class="command">mkdir</code> ~/mpi}}
{{Term|location=frontend|cmd=<code class="command">vi</code> ~/mpi/tp.c}}
<syntaxhighlight lang="c">#include <stdio.h>
<syntaxhighlight lang="c">#include <stdio.h>
#include <mpi.h>
#include <mpi.h>
Line 46: Line 44:
       int bcast_value = 1;
       int bcast_value = 1;


       gethostname (hostname, sizeof hostname);
       gethostname(hostname, sizeof hostname);
       MPI_Init (&argc, &argv);
       MPI_Init(&argc, &argv);
       MPI_Comm_rank (MPI_COMM_WORLD, &rank);
       MPI_Comm_rank(MPI_COMM_WORLD, &rank);
       MPI_Comm_size (MPI_COMM_WORLD, &size);
       MPI_Comm_size(MPI_COMM_WORLD, &size);
       if (!rank) {
       if (!rank) {
             bcast_value = 42;
             bcast_value = 42;
       }
       }
       MPI_Bcast (&bcast_value,1 ,MPI_INT, 0, MPI_COMM_WORLD );
       MPI_Bcast(&bcast_value,1 ,MPI_INT, 0, MPI_COMM_WORLD );
       printf("%s\t- %d - %d - %d\n", hostname, rank, size, bcast_value);
       printf("%s\t- %d - %d - %d\n", hostname, rank, size, bcast_value);
       fflush(stdout);
       fflush(stdout);


       MPI_Barrier (MPI_COMM_WORLD);
       MPI_Barrier(MPI_COMM_WORLD);
       MPI_Finalize ();
       MPI_Finalize();
       return 0;
       return 0;
}
}
</syntaxhighlight>
</syntaxhighlight>
This program uses MPI to communicate between processes; the MPI process of rank 0 will broadcast an integer (value 42) to all the other processes. Then, each process prints its rank, the total number of processes, and the value he got from process 0.
 
You can then compile your code:
{{Term|location=frontend|cmd=<code class="command">mpicc</code> ~/mpi/tp.c -o ~/mpi/tp}}


== Setting up and starting Open MPI on a default environment using <code class=command>oarsh</code> ==
== Setting up and starting Open MPI on a default environment using <code class=command>oarsh</code> ==
Line 69: Line 69:
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -l nodes=3}}
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -l nodes=3}}


Compile your code:
You can connect to the reserved nodes using <code class=command>oarsh</code> which is a wrapper around the <code class=command>ssh</code> command that handle the configuration of the SSH environment. As Open MPI defaults to using <code class=command>ssh</code> for remote startup of processes, you need to add the option <code class=command>--mca orte_rsh_agent "oarsh"</code> to your <code class=command>mpirun</code>  command line.
{{Term|location=node|cmd=<code class="command">mpicc</code> src/mpi/tp.c -o src/mpi/tp}}


<code class=command>oarsh</code> is the default connector used when you reserve a node. To be able to use this connector, you need to add the option <code class=command>--mca plm_rsh_agent "oarsh"</code> to mpirun.
{{Note|text=For Debian Wheezy, uses '''plm_rsh_agent''' instead of '''orte_rsh_agent'''}}


{{Term|location=node|cmd=<code class="command">mpirun</code> --mca plm_rsh_agent "oarsh" -machinefile $OAR_NODEFILE $HOME/src/mpi/tp}}
{{Term|location=node|cmd=<code class="command">mpirun</code> --mca orte_rsh_agent "oarsh" -machinefile $OAR_NODEFILE ~/mpi/tp}}


You can also set an environment variable (usually in your .bashrc):
You can also set an environment variable (usually in your .bashrc):
{{Term|location=bashrc|cmd=<code class="command">export</code> OMPI_MCA_plm_rsh_agent=oarsh}}
{{Term|location=bashrc|cmd=<code class="command">export</code> OMPI_MCA_orte_rsh_agent=oarsh}}
{{Term|location=node|cmd=<code class="command">mpirun</code> -machinefile $OAR_NODEFILE ~/mpi/tp}}


Open MPI also provides a config file solution. In your home, create a file as <code class="file">~/.openmpi/mca-params.conf</code>
Open MPI also provides a configuration file for <code class=command>--mca</code> parameters. In your home directory, create a file as <code class="file">~/.openmpi/mca-params.conf</code>
<pre class="brush: bash">
<pre class="brush: bash">
plm_rsh_agent=oarsh
orte_rsh_agent=oarsh
filem_rsh_agent=oarcp
filem_rsh_agent=oarcp
</pre>
</pre>
Line 99: Line 99:
  helios-53      - 11 - 12 - 42
  helios-53      - 11 - 12 - 42


You may have (lot's of) warning messages if Open MPI doesn't find high performance hardware: don't be afraid, it's normal but you could use [[FAQ#How_to_use_MPI_in_Grid5000.3F]] to avoid them. This can looks like this:
You may have (lot's of) warning messages if Open MPI cannot take advantage of any high performance hardware. At this point of the tutorial, this is not important as we will learn how to select clusters with high performance interconnect in greater details below. Error messages might look like this:
  [[2616,1],2]: A high-performance Open MPI point-to-point messaging module
  [[2616,1],2]: A high-performance Open MPI point-to-point messaging module
  was unable to find any relevant network interfaces:
  was unable to find any relevant network interfaces:
Line 120: Line 120:
  ...
  ...


exit the job.
You could use [[FAQ#How_to_use_MPI_in_Grid5000.3F]] to avoid this warnings.
 
==Setting up and starting Open MPI on a default environment using allow_classic_ssh==
Submit a job with the <code>allow_classic_ssh</code> type:
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -t allow_classic_ssh -l nodes=3}}
 
Launch your parallel job:
{{Term|location=node|cmd=<code class="command">mpirun</code> -machinefile $OAR_NODEFILE ~/mpi/tp}}


== Setting up and starting Open MPI to use high performance interconnect ==
== Setting up and starting Open MPI to use high performance interconnect ==
By default, Open MPI tries to use any high performance interconnect he can find. But it works only if the related libraries were found during the compilation of Open Mpi (not during the compilation of your application). It should work if you built Open MPI on a wheezy-x64 environment, and it also works correctly on the default environment.
By default, Open MPI tries to use any high performance interconnect it can find. But it works only if the related libraries were found during the compilation of Open Mpi (not during the compilation of your application). It should work if you built Open MPI on a jessie-x64 environment, and it also works correctly on the default environment.
 
Options can be used to either select or disable an interconnect.


{{Note|text= If you want to disable support for high performance network, use '''--mca btl self,sm,tcp''', but beware: Open MPI will use all the TCP networks available, and so IPoverIB will be used also if available; use '''--mca btl_tcp_if_exclude ib0,lo,myri0''' to disable also IP emulation of high performance interconnect}}
MCA parameters ('''--mca''') can be used to select the drivers that are used at run-time by Open MPI. To learn more about the MCA parameters, see also:
* [https://www.open-mpi.org/faq/?category=tuning#mca-params The Open MPI FAQ about tuning parameters]
* [http://www.open-mpi.org/faq/?category=tcp#tcp-selection How do I tell Open MPI which IP interfaces / networks to use?]
* [http://www.open-mpi.org/faq/?category=openfabrics The Open MPI documentation] about [https://en.wikipedia.org/wiki/OpenFabrics_Alliance OpenFabrics] (ie: [https://en.wikipedia.org/wiki/InfiniBand Infiniband])
* [https://www.open-mpi.org/faq/?category=myrinet  The Open MPI documentation] about [https://en.wikipedia.org/wiki/Myrinet Myrinet]


We will be using the Netpipe tool to check if the high performance interconnect is really used: download it from this URL: http://pkgs.fedoraproject.org/repo/pkgs/NetPIPE/NetPIPE-3.7.1.tar.gz/5f720541387be065afdefc81d438b712/NetPIPE-3.7.1.tar.gz
{{Note|text=If you want to disable support for high performance networks, use '''--mca btl self,sm,tcp'''. This will switch to TCP but if IPoverIB is available, Infiniband will still be used. To also disable IP emulation of high performance interconnect, use '''--mca btl_tcp_if_exclude ib0,lo,myri0'''  or select a specific interface with '''--mca btl_tcp_if_include eth1'''.}}


{{Warning|text=NetPipe runs only between two MPI processes. Please make sure you have only two lines within your <code>node</code> file. Otherwise, MPI will launch more than two processes, which is incompatible with NetPipe.}}
We will be using [http://pkgs.fedoraproject.org/repo/pkgs/NetPIPE/NetPIPE-3.7.1.tar.gz/5f720541387be065afdefc81d438b712/NetPIPE-3.7.1.tar.gz NetPIPE] to check the performances of high performance interconnects.


To download, extract and compile NetPIPE, do:
{{Term|location=frontend|cmd=<code class="command">cd</code> ~/mpi}}
{{Term|location=frontend|cmd=<code class="command">wget</code> http://pkgs.fedoraproject.org/repo/pkgs/NetPIPE/NetPIPE-3.7.1.tar.gz/5f720541387be065afdefc81d438b712/NetPIPE-3.7.1.tar.gz }}
{{Term|location=frontend|cmd=<code class="command">wget</code> http://pkgs.fedoraproject.org/repo/pkgs/NetPIPE/NetPIPE-3.7.1.tar.gz/5f720541387be065afdefc81d438b712/NetPIPE-3.7.1.tar.gz }}
{{Note|text= Think to configure proxy if <code class="command">wget</code> freezes on "connecting", see [https://www.grid5000.fr/mediawiki/index.php/Getting_Started#Customizing_nodes_and_accessing_the_Internet concerned part in Getting Started] }}
{{Term|location=frontend|cmd=<code class="command">tar</code> -xf NetPIPE-3.7.1.tar.gz}}
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I}}
{{Term|location=frontend|cmd=<code class="command">cd</code> NetPIPE-3.7.1}}
{{Term|location=node|cmd=<code class="command">cd</code> $HOME/src/mpi}}
{{Term|location=frontend|cmd=<code class="command">make</code> mpi}}
Unarchive Netpipe
{{Term|location=node|cmd=<code class="command">tar</code> zvxf ~/NetPIPE-3.7.1.tar.gz}}
{{Term|location=node|cmd=<code class="command">cd</code> NetPIPE-3.7.1}}
Compile
{{Term|location=node|cmd=<code class="command">make</code> mpi}}


=== Infiniband hardware :  ===
As NetPipe only works between two MPI processes, we will reserve one core on two distinct nodes. If your reservation includes more resources, you will have to create a MPI machinefile file ('''--machinefile''') with only two entries as follow:
Infiniband hardware is available on severals sites (see [https://www.grid5000.fr/mediawiki/index.php/Special:G5KHardware#High_performance_network_families Hardware page]):
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -l nodes=2}}
* Rennes (20G)
{{Term|location=node|cmd=<code class="command">uniq $OAR_NODEFILE &#124; head -n 2 > /tmp/machinefile</code>}}
* Nancy (20G)
* Grenoble (20G & 40G)


To reserve one core on two nodes with a 10G infiniband interconnect:
Infiniband hardware is available on several sites. For example, you will find clusters with Infiniband interconnect at Rennes (20G), Nancy (20G) and Grenoble (20G & 40G). Myrinet hardware is available at Lille (10G) (see [https://www.grid5000.fr/mediawiki/index.php/Special:G5KHardware#High_performance_network_families Hardware page]).


{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -l /nodes=2/core=1 -p "ib10g='YES'"}}
To reserve two core on two distinct nodes with:
or for 20G:
* a 20G InfiniBand interconnect:
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -l /nodes=2/core=1  -p "ib20g='YES'"}}
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -l /nodes=2/core=1  -p "ib20g='YES'"}}
 
* a 40G InfiniBand interconnect:
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -l /nodes=2/core=1 -p "ib40g='YES'"}}
* a 10G Myrinet interconnect:
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -l /nodes=2/core=1 -p "myri10g='YES'"}}
To test the network:
To test the network:
{{Term|location=node|cmd=<code class="command">cd</code> $HOME/src/mpi/NetPIPE-3.7.1}}
{{Term|location=node|cmd=<code class="command">cd</code> ~/mpi/NetPIPE-3.7.1}}
{{Term|location=node|cmd=<code class="command">mpirun</code> --mca  plm_rsh_agent "oarsh" -machinefile $OAR_NODEFILE NPmpi}}
{{Term|location=node|cmd=<code class="command">mpirun</code> --mca  orte_rsh_agent "oarsh" -machinefile $OAR_NODEFILE NPmpi}}
 
To check if the support for infiniband is available in Open MPI, run:


To check if the support for InfiniBand is available in Open MPI, run:
{{Term|location=node|cmd=<code class="command">ompi_info</code> &#124; grep openib}}
{{Term|location=node|cmd=<code class="command">ompi_info</code> &#124; grep openib}}
you should see something like this:
you should see something like this:
                 MCA btl: openib (MCA v2.0, API v2.0, Component v1.4.1)
                 MCA btl: openib (MCA v2.0, API v2.0, Component v1.4.1)
To check if the support for Myrinet is available in Open MPI, run:
{{Term|location=node|cmd=<code class="command">ompi_info </code> &#124; grep mx}}
(if the output is empty, there is no builtin mx support).


With Infiniband 40G (QDR), you should have much better performance that using ethernet or Myrinet 2G:
Without high performance interconnect, results looks like this:
 
    0:        1 bytes  4080 times -->      0.31 Mbps in      24.40 usec     
  0:      1 bytes  30716 times -->      4.53 Mbps in      1.68 usec
    1:        2 bytes  4097 times -->      0.63 Mbps in      24.36 usec     
  1:      2 bytes  59389 times -->      9.10 Mbps in      1.68 usec
    ...
...
    122: 8388608 bytes      3 times -->    896.14 Mbps in  71417.13 usec
121: 8388605 bytes    17 times -->  25829.13 Mbps in    2477.82 usec
    123: 8388611 bytes      3 times -->    896.17 Mbps in  71414.83 usec
122: 8388608 bytes    20 times -->  25841.35 Mbps in    2476.65 usec
The latency is given by the last column for a 1 byte message; the maximum throughput is given by the last line (896.17 Mbps in this case).  
123: 8388611 bytes    20 times -->  25823.40 Mbps in    2478.37 usec
 
Less than 2 microsec in latency and almost 26Gbit/s in bandwitdh !
 
=== Myrinet hardware : ===
Myrinet hardware is available on severals sites (see [https://www.grid5000.fr/mediawiki/index.php/Special:G5KHardware#High_performance_network_families Hardware page]):
* Lille (10G)
* Sophia (10G)
 
To reserve one core on two nodes with a 10G Myrinet interconnect:<br>
 
Myrinet 10G:
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -l /nodes=2/core=1 -p "myri10g='YES'"}}
 
To test the network:
{{Term|location=node|cmd=<code class="command">cd</code> $HOME/src/mpi/NetPIPE-3.7.1}}
{{Term|location=node|cmd=<code class="command">mpirun</code> --mca  plm_rsh_agent "oarsh" -machinefile $OAR_NODEFILE NPmpi}}
 
you should have something like that:
  0:        1 bytes  4080 times -->      0.31 Mbps in      24.40 usec     
  1:        2 bytes  4097 times -->      0.63 Mbps in      24.36 usec     
  ...
  122: 8388608 bytes      3 times -->    896.14 Mbps in  71417.13 usec
  123: 8388611 bytes      3 times -->    896.17 Mbps in  71414.83 usec
 
The minimum latency is given by the last column for a 1 byte message;
the maximum throughput is given by the last line, 896.17 Mbps in this case.
So in this case a latency of 24usec is very high, therefore the myrinet was not used as expected. It can happen if Open MPI has not found the mx libraries during the compilation. You can check this with :
{{Term|location=node|cmd=<code class="command">ompi_info </code> &#124; grep mx }}
If the output is empty, there is no mx support builtin.


With a myrinet2G network, typical result should looks like this:
With a Myrinet2G network, typical result looks like this:


This time we have:
This time we have:
Line 209: Line 192:
  122: 8388608 bytes      3 times -->  1773.88 Mbps in  36079.17 usec
  122: 8388608 bytes      3 times -->  1773.88 Mbps in  36079.17 usec
  123: 8388611 bytes      3 times -->  1773.56 Mbps in  36085.69 usec
  123: 8388611 bytes      3 times -->  1773.56 Mbps in  36085.69 usec
In this example, we have 3.77 ms of latency and almost 1.8 Gbit/s of bandwitdh.
With Infiniband 40G (QDR), you should have much better performance that using Ethernet or Myrinet 2G or Infiniband 20G:
  0:      1 bytes  30716 times -->      4.53 Mbps in      1.68 usec
  1:      2 bytes  59389 times -->      9.10 Mbps in      1.68 usec
...
121: 8388605 bytes    17 times -->  25829.13 Mbps in    2477.82 usec
122: 8388608 bytes    20 times -->  25841.35 Mbps in    2476.65 usec
123: 8388611 bytes    20 times -->  25823.40 Mbps in    2478.37 usec


In this example, we have 3.77usec and almost 1.8Gbps
Less than 2 ms of latency and almost 26 Gbit/s of bandwitdh !


= More advanced use cases=
= More advanced use cases=
Line 216: Line 208:
== Running MPI on several sites at once ==
== Running MPI on several sites at once ==


In this tutorial, we use the following sites: rennes, sophia and grenoble. But you can now use any site. For running MPI applications on several site, we will be using oargrid. See [[Grid_jobs_management]] tutorial for more information.
In this tutorial, we use the following sites: rennes, sophia and grenoble. For making reservation on multiple sites, we will be using oargrid. See the [[Grid_jobs_management]] tutorial for more information.


{{Warning|text=There is still a problem when using lille and luxembourg nodes simultaneously.}}
{{Warning|text=There is still a problem when using lille and luxembourg nodes simultaneously.}}


{{Warning|text=Open MPI tries to figure out the best network interface at run time, and he also assumes that some networks are not routed between sites. To avoid this kind of problems, we must add the option '''--mca opal_net_private_ipv4 "192.168.160.0/24\;192.168.14.0/23" --mca btl_tcp_if_exclude ib0,lo,myri0 ''' to mpirun}}
{{Warning|text=Open MPI tries to figures out the best network interface at run time, and it also assumes that some networks are not routed between sites. To avoid this kind of problems, we must add the option '''--mca opal_net_private_ipv4 "192.168.160.0/24\;192.168.14.0/23" --mca btl_tcp_if_exclude ib0,lo,myri0 ''' to mpirun}}


{{Note|text=For multiple sites, we may want to only use tcp , and not native mx and native infiniband; to do this, add this option to mpirun: '''--mca btl self,sm,tcp'''}}
{{Note|text=For multiple sites, we should only use TCP, and there is no MX or InfiniBand network between sites. Therefore, add this option to mpirun: '''--mca btl self,sm,tcp'''}}


The MPI program must be available on each site you want to use. From the frontend of one site, copy the mpi/ directory to the two other sites. You can do this with '''rsync'''. Suppose that you are connected at sophia and that you want to copy the sophia's '''mpi/''' directoy to grenoble and rennes.
{{Term|location=fsophia|cmd=<code class="command">rsync</code> -avz ~/mpi/ rennes.grid5000.fr:mpi/}}
{{Term|location=fsophia|cmd=<code class="command">rsync</code> -avz ~/mpi/ grenoble.grid5000.fr:mpi/}}
(you can also add the ''--delete' option to remove extraneous files from the mpi directory of rennes and grenoble).


Synchronize the src/mpi directory, from the frontend (the tp binary must be available on all sites), to the two other sites. Here we supposed we are connected on sophia, and we want to synchronize to grenoble and rennes.
Reserve nodes on the each sites from any frontend with oargridsub (you can also add options to reserve nodes from specific clusters if you want to):
{{Term|location=frontend.sophia|cmd=<code class="command">ssh</code> rennes mkdir -p src/mpi/}}
{{Term|location=frontend|cmd=<code class="command">oargridsub</code> -w 02:00:00 <code class="replace">rennes</code>:rdef="nodes=2",<code class="replace">grenoble</code>:rdef="nodes=2",<code class="replace">sophia</code>:rdef="nodes=2" > oargrid.out}}
{{Term|location=frontend.sophia|cmd=<code class="command">rsync</code> --delete -avz ~/src/mpi/ rennes.grid5000.fr:src/mpi/}}
{{Term|location=frontend.sophia|cmd=<code class="command">ssh</code> grenoble mkdir -p src/mpi/}}
{{Term|location=frontend.sophia|cmd=<code class="command">rsync</code> --delete -avz ~/src/mpi/ grenoble.grid5000.fr:src/mpi/}}
 
Reserve nodes on the 3 sites with oargridsub (you can reserve nodes from specific clusters if you want to).
{{Term|location=frontend|cmd=<code class="command">oargridsub</code> -w 02:00:00 <code class="replace">rennes</code>:rdef="nodes=2",<code class="replace">grenoble</code>:rdef="nodes=2",<code class="replace">sophia</code>:rdef="nodes=2" > oargrid.out}}
Get the oargrid Id and Job key from the output of oargridsub:
Get the oargrid Id and Job key from the output of oargridsub:
{{Term|location=frontend|cmd=<code class="command">export</code> OAR_JOB_KEY_FILE=`grep "SSH KEY" oargrid.out &#124; cut -f2 -d: &#124; tr -d " "`}}
{{Term|location=frontend|cmd=<code class="command">export</code> OAR_JOB_KEY_FILE=$(grep "SSH KEY" oargrid.out &#124; cut -f2 -d: &#124; tr -d " ")}}
{{Term|location=frontend|cmd=<code class="command">export</code> OARGRID_JOB_ID=`grep "Grid reservation id" oargrid.out &#124; cut -f2 -d=`}}
{{Term|location=frontend|cmd=<code class="command">export</code> OARGRID_JOB_ID=$(grep "Grid reservation id" oargrid.out &#124; cut -f2 -d=)}}
Get the node list using oargridstat and copy the list to the first node:
Get the node list using oargridstat and copy the list to the first node:
{{Term|location=frontend|cmd=<code class="command">oargridstat</code> -w -l $OARGRID_JOB_ID  &#124; grep -v ^$ > ~/gridnodes}}
{{Term|location=frontend|cmd=<code class="command">oargridstat</code> -w -l $OARGRID_JOB_ID  &#124; grep -v ^$ > ~/gridnodes}}
{{Term|location=frontend|cmd=<code class="command">oarcp</code> ~/gridnodes `head -1 ~/gridnodes`:}}
{{Term|location=frontend|cmd=<code class="command">oarcp</code> ~/gridnodes $(head -1 ~/gridnodes):}}
Connect to the first node:
Connect to the first node:
{{Term|location=frontend|cmd=<code class="command">oarsh</code> `head -1 ~/gridnodes`}}
{{Term|location=frontend|cmd=<code class="command">oarsh</code> $(head -1 ~/gridnodes)}}
And run your MPI application:
And run your MPI application:
{{Term|location=node|cmd=<code class="command">cd</code> $HOME/src/mpi/}}
{{Term|location=node|cmd=<code class="command">cd</code> ~/mpi/}}
{{Term|location=node|cmd=<code class="command">mpirun</code> -machinefile ~/gridnodes --mca plm_rsh_agent "oarsh" --mca opal_net_private_ipv4 "192.168.160.0/24\;192.168.14.0/23" --mca btl_tcp_if_exclude ib0,lo,myri0 --mca btl self,sm,tcp tp}}
{{Term|location=node|cmd=<code class="command">mpirun</code> -machinefile ~/gridnodes --mca orte_rsh_agent "oarsh" --mca opal_net_private_ipv4 "192.168.160.0/24\;192.168.14.0/23" --mca btl_tcp_if_exclude ib0,lo,myri0 --mca btl self,sm,tcp tp}}


==Compilation [optional]==
==Compilation of Open MPI ==


If you want to use a custom version of Open MPI, you can compile it in your home directory.
If you want to use a custom version of Open MPI, you can compile it on your home directory.
 
Make an interactive reservation and compile Open MPI from a node. This prevents overloading the site frontend:
* Make an interactive reservation and compile Open MPI on a node :
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I}}
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I}}


Get Open MPI from the [http://www.open-mpi.org/software/ompi/ official website]:
{{Term|location=node|cmd=cd /tmp/}}
{{Term|location=node|cmd=cd /tmp/}}
* Get Open MPI (or here: http://www.open-mpi.org/software/ompi/v1.4/)
{{Term|location=frontend|cmd=<code class="command">wget</code> http://www.open-mpi.org/software/ompi/v1.10/downloads/openmpi-1.10.2.tar.bz2}}
{{Term|location=frontend|cmd=<code class="command">export</code> http_proxy=http://proxy:3128/}}
{{Term|location=node|cmd=tar -xf openmpi-1.10.2.tar.bz2}}
{{Term|location=frontend|cmd=<code class="command">wget</code> http://www.open-mpi.org/software/ompi/v1.4/downloads/openmpi-1.4.3.tar.gz}}
{{Term|location=node|cmd=cd openmpi-1.10.2}}
Unarchive Open MPI
 
{{Term|location=node|cmd=tar -xf openmpi-1.4.3-g5k.tar.gz}}
Run '''configure''':
{{Term|location=node|cmd=cd openmpi-1.4.3}}
configure (wait ~1mn30s)
{{Term|location=node|cmd=<code class="command">./configure</code> --prefix=$HOME/openmpi/  --with-memory-manager=none}}
{{Term|location=node|cmd=<code class="command">./configure</code> --prefix=$HOME/openmpi/  --with-memory-manager=none}}


The source tarball includes a small patch to make Open MPI works on grid5000 on several sites simultaneously. For the curious, the patch is:
Compile:
{{Term|location=node|cmd=<code class="command">make</code> -j8}}


<pre class="brush: c">
--- ompi/mca/btl/tcp/btl_tcp_proc.c.orig        2010-03-23 14:01:28.000000000 +0100
+++ ompi/mca/btl/tcp/btl_tcp_proc.c    2010-03-23 14:01:50.000000000 +0100
@@ -496,7 +496,7 @@
                                local_interfaces[i]->ipv4_netmask)) {
                        weights[i][j] = CQ_PRIVATE_SAME_NETWORK;
                    } else {
-                        weights[i][j] = CQ_PRIVATE_DIFFERENT_NETWORK;
+                        weights[i][j] = CQ_NO_CONNECTION;
                    }
                    best_addr[i][j] = peer_interfaces[j]->ipv4_endpoint_addr;
                }
</pre>
and compile: (wait ~2mn30s)
{{Term|location=node|cmd=<code class="command">make</code> -j4}}
* Install it on your home directory (in $HOME/openmpi/ )
* Install it on your home directory (in $HOME/openmpi/ )
{{Term|location=node|cmd=<code class="command">make install</code>}}
{{Term|location=node|cmd=<code class="command">make install</code>}}


Then you can do the same steps as before, but with <code class='command'>$HOME/openmpi/bin/mpicc</code> and <code class='command'>$HOME/openmpi/bin/mpirun</code>
To use your this version of Open MPI, use <code class='command'>$HOME/openmpi/bin/mpicc</code> and <code class='command'>$HOME/openmpi/bin/mpirun</code> or add the following to your configuration:
{{Term|location=node|cmd=<code class="command">export PATH=$HOME/openmpi/bin/:$PATH</code>}}
 
You should recompile your program before trying to use the new runtime environment.


== Setting up and starting Open MPI on a kadeploy image ==
== Setting up and starting Open MPI on a kadeploy image ==
{{Warning|text=This part of the tutorial is [https://intranet.grid5000.fr/bugzilla/show_bug.cgi?id=5771 known to be buggy]. It is strongly recommended that you skip it, unless you have an important need for Myrinet networking -- however, be prepared to fix it if it is the case.}}
=== Building a kadeploy image ===
=== Building a kadeploy image ===
The default Open MPI version available in debian based distributions is not compiled with high performances libraries like myrinet/MX, therefore we must recompile Open MPI from sources. Fortunately, every default image (wheezy-x64-XXX) but the min variant includes the libraries for high performance interconnects, and Open MPI will find them at compile time.
The default Open MPI version available in Debian based distributions is not compiled with libraries for high performance networks like myrinet/MX, therefore we must recompile Open MPI from sources if we want to use Myrinet networks. Fortunately, every default image (jessie-x64-XXX) but the min variant includes the libraries for high performance interconnects, and Open MPI will find them at compile time.


We will create a kadeploy image based on an existing one.
We will create a kadeploy image based on an existing one.
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -t deploy -l nodes=1,walltime=2 }}
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -t deploy -l nodes=1,walltime=2 }}
{{Term|location=frontend|cmd=<code class="command">kadeploy3</code> -f $OAR_NODEFILE -e wheezy-x64-base -k }}
{{Term|location=frontend|cmd=<code class="command">kadeploy3</code> -f $OAR_NODEFILE -e jessie-x64-base -k }}
Download Open MPI tarball if you don't already have it:
Connect on the first node as root, and install openmpi:
{{Term|location=frontend|cmd=<code class="command">export</code> http_proxy=http://proxy:3128/}}
{{Term|location=frontend|cmd=<code class="command">ssh root@</code>$(head -1 $OAR_NODEFILE)}}
{{Term|location=frontend|cmd=<code class="command">wget</code> http://www.open-mpi.org/software/ompi/v1.4/downloads/openmpi-1.4.3.tar.gz}}
Download Open MPI:
Copy Open MPI tarball on the first node:
{{Term|location=frontend|cmd=<code class="command">cd</code> /tmp/}}
{{Term|location=frontend|cmd=scp openmpi-1.4.3-g5k.tar.gz root@`head -1 $OAR_NODEFILE`:/tmp}}
{{Term|location=frontend|cmd=<code class="command">wget</code> http://www.open-mpi.org/software/ompi/v1.10/downloads/openmpi-1.10.2.tar.bz2}}
Then connect on the deployed node as root, and install openmpi:
{{Term|location=node|cmd=tar -xf openmpi-1.10.2.tar.bz2}}
{{Term|location=frontend|cmd=<code class="command">ssh root@</code>`head -1 $OAR_NODEFILE`}}
{{Term|location=node|cmd=cd openmpi-1.10.2}}
Unarchive Open MPI:
Install g++,  make gfortran, f2c and the BLAS library:
{{Term|location=node|cmd=cd /tmp/}}
{{Term|location=node|cmd=<code class="command">apt-get</code> -y install g++ make gfortran f2c libblas-dev}}
{{Term|location=node|cmd=tar -xf openmpi-1.4.3-g5k.tar.gz}}
{{Term|location=node|cmd=cd openmpi-1.4.3}}
Install gfortran, f2c and blas library:
{{Term|location=node|cmd=<code class="command">apt-get</code> -y install gfortran f2c libblas-dev}}
Configure and compile:
Configure and compile:
{{Term|location=node|cmd=./configure --libdir=/usr/local/lib64 --with-memory-manager=none}}
{{Term|location=node|cmd=./configure --libdir=/usr/local/lib64 --with-memory-manager=none}}
{{Term|location=node|cmd=make -j4}}
{{Term|location=node|cmd=make -j8}}
{{Term|location=node|cmd=make install}}
{{Term|location=node|cmd=make install}}


Create a dedicated user named mpi, in the group rdma (for infiniband)
To run MPI application, we will create a dedicated user named mpi. We add it to the group rdma for Infiniband. Also, we copy the ~root/authorized_keys files so that we can login as user '''mpi''' from the frontend. We also create an SSH key for identifying the '''mpi''' user (needed by Open MPI).
 
<pre class="brush: bash">
<pre class="brush: bash">
useradd -m -g rdma mpi -d /var/mpi
useradd -m -g rdma mpi -d /var/mpi
echo "* hard memlock unlimited" >> /etc/security/limits.conf
echo "* hard memlock unlimited" >> /etc/security/limits.conf
echo "* soft memlock unlimited" >> /etc/security/limits.conf
echo "* soft memlock unlimited" >> /etc/security/limits.conf
mkdir ~mpi/.ssh
mkdir ~mpi/.ssh
cp ~/.ssh/authorized_keys ~mpi/.ssh
cp ~root/.ssh/authorized_keys ~mpi/.ssh
chown -R mpi ~mpi/.ssh
chown -R mpi ~mpi/.ssh
su - mpi
su - mpi
mkdir src
ssh-keygen -N "" -P "" -f /var/mpi/.ssh/id_rsa
ssh-keygen -N "" -P "" -f /var/mpi/.ssh/id_rsa
cat .ssh/id_rsa.pub >> ~/.ssh/authorized_keys
cat .ssh/id_rsa.pub >> ~/.ssh/authorized_keys
echo "        StrictHostKeyChecking no" >> ~/.ssh/config
echo "        StrictHostKeyChecking no" >> ~/.ssh/config
exit
exit # exit session as MPI user
exit
exit # exit the root connection to the node
rsync -avz ~/src/mpi/ mpi@`head -1 $OAR_NODEFILE`:src/mpi/
 
ssh root@`head -1 $OAR_NODEFILE`
# You can then copy your file from the frontend to the '''mpi''' home directory:
rsync -avz ~/mpi/ mpi@$(head -1 $OAR_NODEFILE):mpi/ # copy the tutorial
</pre>
</pre>
Create the image using tgz-g5k:
 
You can save the newly created disk image by using tgz-g5k:
{{Term|location=node|cmd=ssh root@$(head -1 $OAR_NODEFILE)}}
{{Term|location=node|cmd=<code class="command">tgz-g5k</code> /dev/shm/image.tgz}}
{{Term|location=node|cmd=<code class="command">tgz-g5k</code> /dev/shm/image.tgz}}
Disconnect from the node (exit). From the frontend, copy the image to the public directory:
Disconnect from the node (exit). From the frontend, copy the image to the public directory:
{{Term|location=frontend|cmd=<code class="command">mkdir</code> -p $HOME/public}}
{{Term|location=frontend|cmd=<code class="command">mkdir</code> ~/public}}
{{Term|location=frontend|cmd=<code class="command">scp</code> root@`head -1 $OAR_NODEFILE`:/dev/shm/image.tgz $HOME/public/wheezy-openmpi.tgz}}
{{Term|location=frontend|cmd=<code class="command">scp</code> root@$(head -1 $OAR_NODEFILE):/dev/shm/image.tgz $HOME/public/jessie-openmpi.tgz}}
Copy the description file of wheezy-x64-base:
Copy the description file of jessie-x64-base:
{{Term|location=frontend|cmd=grep -v visibility /grid5000/descriptions/wheezy-x64-base-1.4.dsc > $HOME/public/wheezy-openmpi.dsc}}
{{Term|location=frontend|cmd=grep -v visibility /grid5000/descriptions/jessie-x64-base-2016011914.dsc > $HOME/public/jessie-openmpi.dsc}}
Change the image name in the description file; we will use an http URL for multi-site deploiement:
Change the image name in the description file; we will use an http URL for multi-site deployment:
<pre class="brush: bash">perl -i -pe "s@server:///grid5000/images/wheezy-x64-base-1.4.tgz@http://public.$(hostname | cut -d. -f2).grid5000.fr/~$USER/wheezy-openmpi.tgz@" $HOME/public/wheezy-openmpi.dsc
<pre class="brush: bash">perl -i -pe "s@server:///grid5000/images/jessie-x64-base-2016011914.tgz@http://public.$(hostname | cut -d. -f2).grid5000.fr/~$USER/jessie-openmpi.tgz@" $HOME/public/jessie-openmpi.dsc
</pre>
</pre>
Now you can terminate the job:
Now you can terminate the job:
Line 343: Line 321:
==== Single site  ====
==== Single site  ====
{{Term|location=frontend|cmd=oarsub -I -t deploy -l /nodes=3}}
{{Term|location=frontend|cmd=oarsub -I -t deploy -l /nodes=3}}
{{Term|location=frontend|cmd=kadeploy3 -a $HOME/public/wheezy-openmpi.dsc -f $OAR_NODEFILE -k}}
{{Term|location=frontend|cmd=kadeploy3 -a $HOME/public/jessie-openmpi.dsc -f $OAR_NODEFILE -k}}


{{Term|location=frontend|cmd=<code class="command">scp</code> $OAR_NODEFILE  mpi@`head -1 $OAR_NODEFILE`:nodes}}
{{Term|location=frontend|cmd=<code class="command">scp</code> $OAR_NODEFILE  mpi@$(head -1 $OAR_NODEFILE):nodes}}
connect to the first node:
connect to the first node:
{{Term|location=frontend|cmd=<code class="command">ssh</code> mpi@`head -1 $OAR_NODEFILE`}}
{{Term|location=frontend|cmd=<code class="command">ssh</code> mpi@$(head -1 $OAR_NODEFILE)}}
{{Term|location=node|cmd=<code class="command">cd</code> $HOME/src/mpi/}}
{{Term|location=node|cmd=<code class="command">cd</code> ~/mpi/}}
{{Term|location=node|cmd=<code class="command">/usr/local/bin/mpicc</code> tp.c -o tp}}
{{Term|location=node|cmd=<code class="command">/usr/local/bin/mpicc</code> tp.c -o tp}}
{{Term|location=node|cmd=<code class="command">/usr/local/bin/mpirun</code> -machinefile ~/nodes ./tp}}
{{Term|location=node|cmd=<code class="command">/usr/local/bin/mpirun</code> -machinefile ~/nodes ./tp}}
Line 355: Line 333:


{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -t deploy -l /nodes=2 -p "myri<code class="replace">10</code>g='YES'"}}
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -t deploy -l /nodes=2 -p "myri<code class="replace">10</code>g='YES'"}}
{{Term|location=frontend|cmd=<code class="command">kadeploy3</code> -k -a ~/public/wheezy-openmpi.dsc -f $OAR_NODEFILE}}
{{Term|location=frontend|cmd=<code class="command">kadeploy3</code> -k -a ~/public/jessie-openmpi.dsc -f $OAR_NODEFILE}}


Create a nodefile with a single entry per node:
Create a nodefile with a single entry per node:
{{Term|location=frontend|cmd=<code class="command">uniq</code> $OAR_NODEFILE > nodes}}
{{Term|location=frontend|cmd=<code class="command">uniq</code> $OAR_NODEFILE > nodes}}
Copy it to the first node:
Copy it to the first node:
{{Term|location=frontend|cmd=<code class="command">scp</code> nodes mpi@`head -1 nodes`:}}
{{Term|location=frontend|cmd=<code class="command">scp</code> nodes mpi@$(head -1 nodes):}}
connect to the first node:
connect to the first node:
{{Term|location=frontend|cmd=<code class="command">ssh</code> mpi@`head -1 nodes`}}
{{Term|location=frontend|cmd=<code class="command">ssh</code> mpi@$(head -1 nodes)}}
{{Term|location=node|cmd=<code class="command">cd</code> $HOME/src/mpi/NetPIPE-3.7.1}}
{{Term|location=node|cmd=<code class="command">cd</code> ~/mpi/NetPIPE-3.7.1}}
{{Term|location=node|cmd=<code class="command">/usr/local/bin/mpirun</code> -machinefile ~/nodes NPmpi}}
{{Term|location=node|cmd=<code class="command">/usr/local/bin/mpirun</code> -machinefile ~/nodes NPmpi}}


Line 378: Line 356:
Choose three clusters from 3 different sites.
Choose three clusters from 3 different sites.
{{Term|location=frontend|cmd=<code class="command">oargridsub</code> -t deploy -w 02:00:00 <code class="replace">cluster1</code>:rdef="nodes=2",<code class="replace">cluster2</code>:rdef="nodes=2",<code class="replace">cluster3</code>:rdef="nodes=2" > oargrid.out}}
{{Term|location=frontend|cmd=<code class="command">oargridsub</code> -t deploy -w 02:00:00 <code class="replace">cluster1</code>:rdef="nodes=2",<code class="replace">cluster2</code>:rdef="nodes=2",<code class="replace">cluster3</code>:rdef="nodes=2" > oargrid.out}}
{{Term|location=frontend|cmd=<code class="command">export</code> OARGRID_JOB_ID=`grep "Grid reservation id" oargrid.out &#124; cut -f2 -d=`}}
{{Term|location=frontend|cmd=<code class="command">export</code> OARGRID_JOB_ID=$(grep "Grid reservation id" oargrid.out &#124; cut -f2 -d=)}}
Get the node list using oargridstat:
Get the node list using oargridstat:
{{Term|location=frontend|cmd=<code class="command">oargridstat</code> -w -l $OARGRID_JOB_ID &#124;grep grid > ~/gridnodes}}
{{Term|location=frontend|cmd=<code class="command">oargridstat</code> -w -l $OARGRID_JOB_ID &#124;grep grid > ~/gridnodes}}
Line 385: Line 363:


Deploy on all sites using the --multi-server option :
Deploy on all sites using the --multi-server option :
{{Term|location=frontend|cmd=<code class="command">kadeploy3</code> -f gridnodes -a $HOME/public/wheezy-openmpi.dsc -k --multi-server -o ~/nodes.deployed}}
{{Term|location=frontend|cmd=<code class="command">kadeploy3</code> -f gridnodes -a $HOME/public/jessie-openmpi.dsc -k --multi-server -o ~/nodes.deployed}}
{{Term|location=frontend|cmd=<code class="command">scp</code> ~/nodes.deployed mpi@`head -1 ~/nodes.deployed`:}}
{{Term|location=frontend|cmd=<code class="command">scp</code> ~/nodes.deployed mpi@$(head -1 ~/nodes.deployed):}}
connect to the first node:
connect to the first node:
{{Term|location=frontend|cmd=<code class="command">ssh</code> mpi@`head -1 ~/nodes.deployed`}}
{{Term|location=frontend|cmd=<code class="command">ssh</code> mpi@$(head -1 ~/nodes.deployed)}}
{{Term|location=node|cmd=<code class="command">cd</code> $HOME/src/mpi/}}
{{Term|location=node|cmd=<code class="command">cd</code> ~/mpi/}}
{{Term|location=node|cmd=<code class="command">/usr/local/bin/mpirun</code> -machinefile ~/nodes.deployed --mca btl self,sm,tcp --mca opal_net_private_ipv4  "192.168.7.0/24\;192.168.162.0/24\;192.168.160.0/24\;172.24.192.0/18\;172.24.128.0/18\;192.168.200.0/23" tp}}
{{Term|location=node|cmd=<code class="command">/usr/local/bin/mpirun</code> -machinefile ~/nodes.deployed --mca btl self,sm,tcp --mca opal_net_private_ipv4  "192.168.7.0/24\;192.168.162.0/24\;192.168.160.0/24\;172.24.192.0/18\;172.24.128.0/18\;192.168.200.0/23" tp}}
==Setting up and starting Open MPI on a default environment using allow_classic_ssh==
Submit a job with the <code>allow_classic_ssh</code> type:
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -t allow_classic_ssh -l nodes=3}}
Launch your parallel job:
{{Term|location=node|cmd=<code class="command">mpirun</code> -machinefile $OAR_NODEFILE $HOME/src/mpi/tp}}


== MPICH2 ==
== MPICH2 ==
Line 410: Line 381:


Then you can use a script like this to launch mpd/mpirun:
Then you can use a script like this to launch mpd/mpirun:
  NODES=`uniq < $OAR_NODEFILE | wc -l | tr -d ' '`
  NODES=$(uniq < $OAR_NODEFILE | wc -l | tr -d ' ')
  NPROCS=`wc -l < $OAR_NODEFILE | tr -d ' '`
  NPROCS=$(c -l < $OAR_NODEFILE | tr -d ' ')
  mpdboot --rsh=oarsh --totalnum=$NODES --file=$OAR_NODEFILE
  mpdboot --rsh=oarsh --totalnum=$NODES --file=$OAR_NODEFILE
  sleep 1
  sleep 1
  mpirun -n $NPROCS <code class="replace">mpich2binary</code>
  mpirun -n $NPROCS <code class="replace">mpich2binary</code>

Revision as of 11:02, 28 January 2016

Note.png Note

This page is actively maintained by the Grid'5000 team. If you encounter problems, please report them (see the Support page). Additionally, as it is a wiki page, you are free to make minor corrections yourself if needed. If you would like to suggest a more fundamental change, please contact the Grid'5000 team.

Introduction

MPI is a programming interface that enables the communication between processes of a distributed memory system. This tutorial focus on setting up MPI environments on Grid'5000 and only requires a basic understanding of MPI concepts. For instance, you should know that standard MPI processes live in their own memory space and communicate with other processes by calling library routines to send and receive messages. For a comprehensive tutorials on MPI, see the IDRIS course on MPI. There are several freely-available implementations of MPI, including Open MPI, MPICH2, MPICH, LAM, etc. In this practical session, we focus on the Open MPI implementation.

Before following this tutorial you should already have some basic knowledge of OAR (see the Getting Started tutorial) . For the second part of this tutorial, you should also know the basics about OARGRID (see the Advanced OAR tutorial) and Kadeploy (see the Getting Started tutorial).

Running MPI on Grid'5000

When attempting to run MPI on Grid'5000 you'll be faced with a number of challenges, ranging from classical setup problems for MPI software to problems specific to Grid'5000. This practical session aims at driving you through the most common use cases, which are:

  • Setting up and starting Open MPI on a default environment using oarsh.
  • Setting up and starting Open MPI on a default environment using a allow_classic_ssh.
  • Setting up and starting Open MPI to use high performance interconnect.
  • Setting up and starting Open MPI to run on several sites using oargridsub.
  • Setting up and starting Open MPI on a kadeploy image.

Using Open MPI on a default environment

The default Grid'5000 environment provides Open MPI 1.6.5 (see ompi_info).

Creating a sample MPI program

For the purposes of this tutorial, we create a simple MPI program where the MPI process of rank 0 broadcasts an integer (42) to all the other processes. Then, each process prints its rank, the total number of processes and the value he received from the process 0.

On your home directory, create a file ~/mpi/tp.c and copy the source code:

Terminal.png frontend:
mkdir ~/mpi
Terminal.png frontend:
vi ~/mpi/tp.c
#include <stdio.h>
#include <mpi.h>
#include <time.h> /* for the work function only */

int main (int argc, char *argv []) {
       char hostname[257];
       int size, rank;
       int i, pid;
       int bcast_value = 1;

       gethostname(hostname, sizeof hostname);
       MPI_Init(&argc, &argv);
       MPI_Comm_rank(MPI_COMM_WORLD, &rank);
       MPI_Comm_size(MPI_COMM_WORLD, &size);
       if (!rank) {
            bcast_value = 42;
       }
       MPI_Bcast(&bcast_value,1 ,MPI_INT, 0, MPI_COMM_WORLD );
       printf("%s\t- %d - %d - %d\n", hostname, rank, size, bcast_value);
       fflush(stdout);

       MPI_Barrier(MPI_COMM_WORLD);
       MPI_Finalize();
       return 0;
}

You can then compile your code:

Terminal.png frontend:
mpicc ~/mpi/tp.c -o ~/mpi/tp

Setting up and starting Open MPI on a default environment using oarsh

Submit a job:

Terminal.png frontend:
oarsub -I -l nodes=3

You can connect to the reserved nodes using oarsh which is a wrapper around the ssh command that handle the configuration of the SSH environment. As Open MPI defaults to using ssh for remote startup of processes, you need to add the option --mca orte_rsh_agent "oarsh" to your mpirun command line.

Note.png Note

For Debian Wheezy, uses plm_rsh_agent instead of orte_rsh_agent

Terminal.png node:
mpirun --mca orte_rsh_agent "oarsh" -machinefile $OAR_NODEFILE ~/mpi/tp

You can also set an environment variable (usually in your .bashrc):

Terminal.png bashrc:
export OMPI_MCA_orte_rsh_agent=oarsh
Terminal.png node:
mpirun -machinefile $OAR_NODEFILE ~/mpi/tp

Open MPI also provides a configuration file for --mca parameters. In your home directory, create a file as ~/.openmpi/mca-params.conf

orte_rsh_agent=oarsh
filem_rsh_agent=oarcp

You should have something like:

helios-52       - 4 - 12 - 42
helios-51       - 0 - 12 - 42
helios-52       - 5 - 12 - 42
helios-51       - 2 - 12 - 42
helios-52       - 6 - 12 - 42
helios-51       - 1 - 12 - 42
helios-51       - 3 - 12 - 42
helios-52       - 7 - 12 - 42
helios-53       - 8 - 12 - 42
helios-53       - 9 - 12 - 42
helios-53       - 10 - 12 - 42
helios-53       - 11 - 12 - 42

You may have (lot's of) warning messages if Open MPI cannot take advantage of any high performance hardware. At this point of the tutorial, this is not important as we will learn how to select clusters with high performance interconnect in greater details below. Error messages might look like this:

[[2616,1],2]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: helios-8.sophia.grid5000.fr

Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
warning:regcache incompatible with malloc
warning:regcache incompatible with malloc
warning:regcache incompatible with malloc

or like this:

[griffon-80.nancy.grid5000.fr:04866] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_mtl_mx: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
[griffon-80.nancy.grid5000.fr:04866] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_btl_mx: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
[griffon-80.nancy.grid5000.fr:04865] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_mtl_mx: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
[griffon-80.nancy.grid5000.fr:04867] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_mtl_mx: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
...

You could use FAQ#How_to_use_MPI_in_Grid5000.3F to avoid this warnings.

Setting up and starting Open MPI on a default environment using allow_classic_ssh

Submit a job with the allow_classic_ssh type:

Terminal.png frontend:
oarsub -I -t allow_classic_ssh -l nodes=3

Launch your parallel job:

Terminal.png node:
mpirun -machinefile $OAR_NODEFILE ~/mpi/tp

Setting up and starting Open MPI to use high performance interconnect

By default, Open MPI tries to use any high performance interconnect it can find. But it works only if the related libraries were found during the compilation of Open Mpi (not during the compilation of your application). It should work if you built Open MPI on a jessie-x64 environment, and it also works correctly on the default environment.

Options can be used to either select or disable an interconnect.

MCA parameters (--mca) can be used to select the drivers that are used at run-time by Open MPI. To learn more about the MCA parameters, see also:

Note.png Note

If you want to disable support for high performance networks, use --mca btl self,sm,tcp. This will switch to TCP but if IPoverIB is available, Infiniband will still be used. To also disable IP emulation of high performance interconnect, use --mca btl_tcp_if_exclude ib0,lo,myri0 or select a specific interface with --mca btl_tcp_if_include eth1.

We will be using NetPIPE to check the performances of high performance interconnects.

To download, extract and compile NetPIPE, do:

Terminal.png frontend:
cd ~/mpi
Terminal.png frontend:
tar -xf NetPIPE-3.7.1.tar.gz
Terminal.png frontend:
cd NetPIPE-3.7.1
Terminal.png frontend:
make mpi

As NetPipe only works between two MPI processes, we will reserve one core on two distinct nodes. If your reservation includes more resources, you will have to create a MPI machinefile file (--machinefile) with only two entries as follow:

Terminal.png frontend:
oarsub -I -l nodes=2
Terminal.png node:
uniq $OAR_NODEFILE | head -n 2 > /tmp/machinefile

Infiniband hardware is available on several sites. For example, you will find clusters with Infiniband interconnect at Rennes (20G), Nancy (20G) and Grenoble (20G & 40G). Myrinet hardware is available at Lille (10G) (see Hardware page).

To reserve two core on two distinct nodes with:

  • a 20G InfiniBand interconnect:
Terminal.png frontend:
oarsub -I -l /nodes=2/core=1 -p "ib20g='YES'"
  • a 40G InfiniBand interconnect:
Terminal.png frontend:
oarsub -I -l /nodes=2/core=1 -p "ib40g='YES'"
  • a 10G Myrinet interconnect:
Terminal.png frontend:
oarsub -I -l /nodes=2/core=1 -p "myri10g='YES'"

To test the network:

Terminal.png node:
cd ~/mpi/NetPIPE-3.7.1
Terminal.png node:
mpirun --mca orte_rsh_agent "oarsh" -machinefile $OAR_NODEFILE NPmpi

To check if the support for InfiniBand is available in Open MPI, run:

Terminal.png node:
ompi_info | grep openib

you should see something like this:

                MCA btl: openib (MCA v2.0, API v2.0, Component v1.4.1)

To check if the support for Myrinet is available in Open MPI, run:

Terminal.png node:
ompi_info | grep mx

(if the output is empty, there is no builtin mx support).

Without high performance interconnect, results looks like this:

    0:         1 bytes   4080 times -->      0.31 Mbps in      24.40 usec     
    1:         2 bytes   4097 times -->      0.63 Mbps in      24.36 usec     
    ...
    122: 8388608 bytes      3 times -->    896.14 Mbps in   71417.13 usec
    123: 8388611 bytes      3 times -->    896.17 Mbps in   71414.83 usec

The latency is given by the last column for a 1 byte message; the maximum throughput is given by the last line (896.17 Mbps in this case).

With a Myrinet2G network, typical result looks like this:

This time we have:

  0:       1 bytes  23865 times -->      2.03 Mbps in       3.77 usec     
  1:       2 bytes  26549 times -->      4.05 Mbps in       3.77 usec     
...
122: 8388608 bytes      3 times -->   1773.88 Mbps in   36079.17 usec
123: 8388611 bytes      3 times -->   1773.56 Mbps in   36085.69 usec

In this example, we have 3.77 ms of latency and almost 1.8 Gbit/s of bandwitdh.

With Infiniband 40G (QDR), you should have much better performance that using Ethernet or Myrinet 2G or Infiniband 20G:

 0:       1 bytes  30716 times -->      4.53 Mbps in       1.68 usec
 1:       2 bytes  59389 times -->      9.10 Mbps in       1.68 usec
...
121: 8388605 bytes     17 times -->  25829.13 Mbps in    2477.82 usec
122: 8388608 bytes     20 times -->  25841.35 Mbps in    2476.65 usec
123: 8388611 bytes     20 times -->  25823.40 Mbps in    2478.37 usec

Less than 2 ms of latency and almost 26 Gbit/s of bandwitdh !

More advanced use cases

Running MPI on several sites at once

In this tutorial, we use the following sites: rennes, sophia and grenoble. For making reservation on multiple sites, we will be using oargrid. See the Grid_jobs_management tutorial for more information.

Warning.png Warning

There is still a problem when using lille and luxembourg nodes simultaneously.

Warning.png Warning

Open MPI tries to figures out the best network interface at run time, and it also assumes that some networks are not routed between sites. To avoid this kind of problems, we must add the option --mca opal_net_private_ipv4 "192.168.160.0/24\;192.168.14.0/23" --mca btl_tcp_if_exclude ib0,lo,myri0 to mpirun

Note.png Note

For multiple sites, we should only use TCP, and there is no MX or InfiniBand network between sites. Therefore, add this option to mpirun: --mca btl self,sm,tcp

The MPI program must be available on each site you want to use. From the frontend of one site, copy the mpi/ directory to the two other sites. You can do this with rsync. Suppose that you are connected at sophia and that you want to copy the sophia's mpi/ directoy to grenoble and rennes.

Terminal.png fsophia:
rsync -avz ~/mpi/ rennes.grid5000.fr:mpi/
Terminal.png fsophia:
rsync -avz ~/mpi/ grenoble.grid5000.fr:mpi/

(you can also add the --delete' option to remove extraneous files from the mpi directory of rennes and grenoble).

Reserve nodes on the each sites from any frontend with oargridsub (you can also add options to reserve nodes from specific clusters if you want to):

Terminal.png frontend:
oargridsub -w 02:00:00 rennes:rdef="nodes=2",grenoble:rdef="nodes=2",sophia:rdef="nodes=2" > oargrid.out

Get the oargrid Id and Job key from the output of oargridsub:

Terminal.png frontend:
export OAR_JOB_KEY_FILE=$(grep "SSH KEY" oargrid.out | cut -f2 -d: | tr -d " ")
Terminal.png frontend:
export OARGRID_JOB_ID=$(grep "Grid reservation id" oargrid.out | cut -f2 -d=)

Get the node list using oargridstat and copy the list to the first node:

Terminal.png frontend:
oargridstat -w -l $OARGRID_JOB_ID | grep -v ^$ > ~/gridnodes
Terminal.png frontend:
oarcp ~/gridnodes $(head -1 ~/gridnodes):

Connect to the first node:

Terminal.png frontend:
oarsh $(head -1 ~/gridnodes)

And run your MPI application:

Terminal.png node:
cd ~/mpi/
Terminal.png node:
mpirun -machinefile ~/gridnodes --mca orte_rsh_agent "oarsh" --mca opal_net_private_ipv4 "192.168.160.0/24\;192.168.14.0/23" --mca btl_tcp_if_exclude ib0,lo,myri0 --mca btl self,sm,tcp tp

Compilation of Open MPI

If you want to use a custom version of Open MPI, you can compile it on your home directory. Make an interactive reservation and compile Open MPI from a node. This prevents overloading the site frontend:

Terminal.png frontend:
oarsub -I

Get Open MPI from the official website:

Terminal.png node:
cd /tmp/
Terminal.png node:
tar -xf openmpi-1.10.2.tar.bz2
Terminal.png node:
cd openmpi-1.10.2

Run configure:

Terminal.png node:
./configure --prefix=$HOME/openmpi/ --with-memory-manager=none

Compile:

Terminal.png node:
make -j8
  • Install it on your home directory (in $HOME/openmpi/ )
Terminal.png node:
make install

To use your this version of Open MPI, use $HOME/openmpi/bin/mpicc and $HOME/openmpi/bin/mpirun or add the following to your configuration:

Terminal.png node:
export PATH=$HOME/openmpi/bin/:$PATH

You should recompile your program before trying to use the new runtime environment.

Setting up and starting Open MPI on a kadeploy image

Warning.png Warning

This part of the tutorial is known to be buggy. It is strongly recommended that you skip it, unless you have an important need for Myrinet networking -- however, be prepared to fix it if it is the case.

Building a kadeploy image

The default Open MPI version available in Debian based distributions is not compiled with libraries for high performance networks like myrinet/MX, therefore we must recompile Open MPI from sources if we want to use Myrinet networks. Fortunately, every default image (jessie-x64-XXX) but the min variant includes the libraries for high performance interconnects, and Open MPI will find them at compile time.

We will create a kadeploy image based on an existing one.

Terminal.png frontend:
oarsub -I -t deploy -l nodes=1,walltime=2
Terminal.png frontend:
kadeploy3 -f $OAR_NODEFILE -e jessie-x64-base -k

Connect on the first node as root, and install openmpi:

Terminal.png frontend:
ssh root@$(head -1 $OAR_NODEFILE)

Download Open MPI:

Terminal.png frontend:
cd /tmp/
Terminal.png node:
tar -xf openmpi-1.10.2.tar.bz2
Terminal.png node:
cd openmpi-1.10.2

Install g++, make gfortran, f2c and the BLAS library:

Terminal.png node:
apt-get -y install g++ make gfortran f2c libblas-dev

Configure and compile:

Terminal.png node:
./configure --libdir=/usr/local/lib64 --with-memory-manager=none
Terminal.png node:
make -j8
Terminal.png node:
make install

To run MPI application, we will create a dedicated user named mpi. We add it to the group rdma for Infiniband. Also, we copy the ~root/authorized_keys files so that we can login as user mpi from the frontend. We also create an SSH key for identifying the mpi user (needed by Open MPI).

useradd -m -g rdma mpi -d /var/mpi
echo "* hard memlock unlimited" >> /etc/security/limits.conf
echo "* soft memlock unlimited" >> /etc/security/limits.conf

mkdir ~mpi/.ssh
cp ~root/.ssh/authorized_keys ~mpi/.ssh
chown -R mpi ~mpi/.ssh
su - mpi
ssh-keygen -N "" -P "" -f /var/mpi/.ssh/id_rsa
cat .ssh/id_rsa.pub >> ~/.ssh/authorized_keys
echo "        StrictHostKeyChecking no" >> ~/.ssh/config
exit # exit session as MPI user
exit # exit the root connection to the node

# You can then copy your file from the frontend to the '''mpi''' home directory:
rsync -avz ~/mpi/ mpi@$(head -1 $OAR_NODEFILE):mpi/ # copy the tutorial

You can save the newly created disk image by using tgz-g5k:

Terminal.png node:
ssh root@$(head -1 $OAR_NODEFILE)
Terminal.png node:
tgz-g5k /dev/shm/image.tgz

Disconnect from the node (exit). From the frontend, copy the image to the public directory:

Terminal.png frontend:
mkdir ~/public
Terminal.png frontend:
scp root@$(head -1 $OAR_NODEFILE):/dev/shm/image.tgz $HOME/public/jessie-openmpi.tgz

Copy the description file of jessie-x64-base:

Terminal.png frontend:
grep -v visibility /grid5000/descriptions/jessie-x64-base-2016011914.dsc > $HOME/public/jessie-openmpi.dsc

Change the image name in the description file; we will use an http URL for multi-site deployment:

perl -i -pe "s@server:///grid5000/images/jessie-x64-base-2016011914.tgz@http://public.$(hostname | cut -d. -f2).grid5000.fr/~$USER/jessie-openmpi.tgz@" $HOME/public/jessie-openmpi.dsc

Now you can terminate the job:

Terminal.png frontend:
oardel $OAR_JOB_ID

Using a kadeploy image

Single site

Terminal.png frontend:
oarsub -I -t deploy -l /nodes=3
Terminal.png frontend:
kadeploy3 -a $HOME/public/jessie-openmpi.dsc -f $OAR_NODEFILE -k
Terminal.png frontend:
scp $OAR_NODEFILE mpi@$(head -1 $OAR_NODEFILE):nodes

connect to the first node:

Terminal.png frontend:
ssh mpi@$(head -1 $OAR_NODEFILE)
Terminal.png node:
cd ~/mpi/
Terminal.png node:
/usr/local/bin/mpicc tp.c -o tp
Terminal.png node:
/usr/local/bin/mpirun -machinefile ~/nodes ./tp

Single site with Myrinet hardware

Terminal.png frontend:
oarsub -I -t deploy -l /nodes=2 -p "myri10g='YES'"
Terminal.png frontend:
kadeploy3 -k -a ~/public/jessie-openmpi.dsc -f $OAR_NODEFILE

Create a nodefile with a single entry per node:

Terminal.png frontend:
uniq $OAR_NODEFILE > nodes

Copy it to the first node:

Terminal.png frontend:
scp nodes mpi@$(head -1 nodes):

connect to the first node:

Terminal.png frontend:
ssh mpi@$(head -1 nodes)
Terminal.png node:
cd ~/mpi/NetPIPE-3.7.1
Terminal.png node:
/usr/local/bin/mpirun -machinefile ~/nodes NPmpi

This time we have:

  0:       1 bytes  23865 times -->      2.03 Mbps in       3.77 usec     
  1:       2 bytes  26549 times -->      4.05 Mbps in       3.77 usec     
...
122: 8388608 bytes      3 times -->   1773.88 Mbps in   36079.17 usec
123: 8388611 bytes      3 times -->   1773.56 Mbps in   36085.69 usec

This time we have 3.77usec, which is good, and almost 1.8Gbps. We are using the myrinet interconnect!

Multiple sites

Choose three clusters from 3 different sites.

Terminal.png frontend:
oargridsub -t deploy -w 02:00:00 cluster1:rdef="nodes=2",cluster2:rdef="nodes=2",cluster3:rdef="nodes=2" > oargrid.out
Terminal.png frontend:
export OARGRID_JOB_ID=$(grep "Grid reservation id" oargrid.out | cut -f2 -d=)

Get the node list using oargridstat:

Terminal.png frontend:
oargridstat -w -l $OARGRID_JOB_ID |grep grid > ~/gridnodes


Deploy on all sites using the --multi-server option :

Terminal.png frontend:
kadeploy3 -f gridnodes -a $HOME/public/jessie-openmpi.dsc -k --multi-server -o ~/nodes.deployed
Terminal.png frontend:
scp ~/nodes.deployed mpi@$(head -1 ~/nodes.deployed):

connect to the first node:

Terminal.png frontend:
ssh mpi@$(head -1 ~/nodes.deployed)
Terminal.png node:
cd ~/mpi/
Terminal.png node:
/usr/local/bin/mpirun -machinefile ~/nodes.deployed --mca btl self,sm,tcp --mca opal_net_private_ipv4 "192.168.7.0/24\;192.168.162.0/24\;192.168.160.0/24\;172.24.192.0/18\;172.24.128.0/18\;192.168.200.0/23" tp

MPICH2

Warning.png Warning

This documentation is about using MPICH2 with the MPD process manager. But the default process manager for MPICH2 is now Hydra. See also: The MPICH documentation.

If you want/need to use MPICH2 on Grid5000, you should do this:

First, you have to do this once (on each site)

Terminal.png frontend:
echo "MPD_SECRETWORD=secret" > $HOME/.mpd.conf
Terminal.png frontend:
chmod 600 $HOME/.mpd.conf

Then you can use a script like this to launch mpd/mpirun:

NODES=$(uniq < $OAR_NODEFILE | wc -l | tr -d ' ')
NPROCS=$(c -l < $OAR_NODEFILE | tr -d ' ')
mpdboot --rsh=oarsh --totalnum=$NODES --file=$OAR_NODEFILE
sleep 1
mpirun -n $NPROCS mpich2binary