Run MPI On Grid'5000: Difference between revisions

From Grid5000
Jump to navigation Jump to search
No edit summary
 
(108 intermediate revisions by 20 users not shown)
Line 1: Line 1:
{{Maintainer|Nicolas Niclausse}}
{{Maintainer|Jeremie Gaidamour}}
{{Portal|User}}
{{Portal|User}}
{{Portal|Tutorial}}
{{Portal|Tutorial}}
{{Portal|MPI}}
{{Pages|HPC}}
{{Portal|HPC}}
{{TutorialHeader}}
{{TutorialHeader}}
= Running MPI on Grid'5000 =
When attempting to run MPI on Grid'5000 you'll be faced with a number of challenges, ranging from classical setup problems for MPI software to problems specific to Grid'5000. This practical session aims at driving you through the most common use cases, which are:
* Setting up and starting Open MPI on a default environment using <code class='command'>oarsh</code>.
* Setting up and starting Open MPI to use high performance interconnect.
* Setting up and starting Open MPI to run on several sites using <code class='command'>oargridsub</code>.
* Setting up and starting Open MPI on a default environment using a <code class='command'>allow_classic_ssh</code>.
* Setting up and starting Open MPI on a kadeploy image.


Several implementation of MPI exist: Open MPI, MPICH2, MPICH, LAM, etc.
= Introduction =


In this practical session, we will focus on [http://www.open-mpi.org Open MPI].
[https://en.wikipedia.org/wiki/Message_Passing_Interface MPI] is a programming interface that enables the communication between processes of a distributed memory system. This tutorial focuses on setting up MPI environments on Grid'5000 and only requires a basic understanding of MPI concepts.
For instance, you should know that standard MPI processes live in their own memory space and communicate with other processes by calling library routines to send and receive messages. For a comprehensive tutorials on MPI, see the [http://www.idris.fr/formations/mpi/ IDRIS course on MPI]. There are several freely-available implementations of MPI, including Open MPI, MPICH2, MPICH, LAM, etc. In this practical session, we focus on the [http://www.open-mpi.org Open MPI] implementation.


=Pre-requisite=
Before following this tutorial you should already have some basic knowledge of OAR (see the [[Getting Started]] tutorial). For the second part of this tutorial, you should also know the basics about Multi-site reservation (see the [[Advanced OAR]] tutorial).
* Basic knowledge of MPI; if you don't know MPI, you can read: [[Grid_computation]]. For a more comprehensive tutorials on MPI, see [http://www.idris.fr/data/cours/parallel/mpi/choix_doc.html IDRIS courses on MPI].
* Knowledge of OAR ([[Getting Started]] tutorial), and for the second part of this tutorial, basic knowledge of OARGRID ([[Advanced OAR]] tutorial) and Kadeploy ([[Getting Started]] tutorial).


=Overview=
== Running MPI on Grid'5000 ==
Since june 2010 the same default environment is available on every sites, therefore, you can use the default MPI library available on this environment (Open MPI 1.4.5).
When attempting to run MPI on Grid'5000 you will face a number of challenges, ranging from classical setup problems for MPI software to problems specific to Grid'5000. This practical session aims at driving you through the most common use cases, which are:
 
* Setting up and starting Open MPI on a default environment.
= Using Open MPI on a default environment=
* Setting up and starting Open MPI to use high performance interconnect.
* Setting up and starting latest Open MPI library version.
* Setting up and starting Open MPI to run on several sites using <code class='command'>funk</code>.


= Using Open MPI on a default environment =


The default Grid'5000 environment provides Open MPI 4.1.0 (see <code class='command'>ompi_info</code>).


== Creating a sample MPI program ==


==Create a sample MPI program==
For the purposes of this tutorial, we create a simple MPI program where the MPI process of rank 0 broadcasts an integer (42) to all the other processes. Then, each process prints its rank, the total number of processes and the value it received from the process 0.
* We will use a very basic MPI program to test OAR/MPI. Create a file <code class="file">$HOME/src/mpi/tp.c</code> and copy the following source code:


{{Term|location=frontend|cmd=<code class="command">mkdir</code> -p $HOME/src/mpi}}
In your home directory, create a file <code class="file">~/mpi/tp.c</code> and copy the source code:
{{Term|location=frontend|cmd=<code class="command">vi</code> $HOME/src/mpi/tp.c}}


the source code:
{{Term|location=frontend|cmd=<code class="command">mkdir</code> ~/mpi<br>
<code class="command">vi</code> ~/mpi/tp.c}}
<syntaxhighlight lang="c">#include <stdio.h>
<syntaxhighlight lang="c">#include <stdio.h>
#include <mpi.h>
#include <mpi.h>
#include <time.h> /* for the work function only */
#include <time.h> /* for the work function only */
#include <unistd.h>


int main (int argc, char *argv []) {
int main (int argc, char *argv []) {
       char hostname[257];
       char hostname[257];
       int size, rank;
       int size, rank;
      int i, pid;
       int bcast_value = 1;
       int bcast_value = 1;


       gethostname (hostname, sizeof hostname);
       gethostname(hostname, sizeof hostname);
       MPI_Init (&argc, &argv);
       MPI_Init(&argc, &argv);
       MPI_Comm_rank (MPI_COMM_WORLD, &rank);
       MPI_Comm_rank(MPI_COMM_WORLD, &rank);
       MPI_Comm_size (MPI_COMM_WORLD, &size);
       MPI_Comm_size(MPI_COMM_WORLD, &size);
       if (!rank) {
       if (!rank) {
             bcast_value = 42;
             bcast_value = 42;
       }
       }
       MPI_Bcast (&bcast_value,1 ,MPI_INT, 0, MPI_COMM_WORLD );
       MPI_Bcast(&bcast_value,1 ,MPI_INT, 0, MPI_COMM_WORLD );
       printf("%s\t- %d - %d - %d\n", hostname, rank, size, bcast_value);
       printf("%s\t- %d - %d - %d\n", hostname, rank, size, bcast_value);
       fflush(stdout);
       fflush(stdout);


       MPI_Barrier (MPI_COMM_WORLD);
       MPI_Barrier(MPI_COMM_WORLD);
       MPI_Finalize ();
       MPI_Finalize();
       return 0;
       return 0;
}
}
</syntaxhighlight>
</syntaxhighlight>
This program uses MPI to communicate between processes; the MPI process of rank 0 will broadcast an integer (value 42) to all the other processes. Then, each process prints its rank, the total number of processes, and the value he got from process 0.


== Setting up and starting Open MPI on a default environment using <code class=command>oarsh</code> ==
You can then compile your code:
{{Term|location=frontend|cmd=<code class="command">mpicc</code> ~/mpi/tp.c -o ~/mpi/tp}}
 
== Setting up and starting Open MPI on a default environment ==


Submit a job:
Submit a job:
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -l nodes=3}}
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -l nodes=3}}


Compile your code:
OAR provides the $OAR_NODEFILE file, containing the list of nodes of the job (one line per CPU core). You can use it to launch your parallel job:
{{Term|location=node|cmd=<code class="command">mpicc</code> src/mpi/tp.c -o src/mpi/tp}}
{{Term|location=node|cmd=<code class="command">mpirun</code> -machinefile $OAR_NODEFILE ~/mpi/tp}}
 
<code class=command>oarsh</code> is the default connector used when you reserve a node. To be able to use this connector, you need to add the option <code class=command>--mca plm_rsh_agent "oarsh"</code> to mpirun.
 
{{Term|location=node|cmd=<code class="command">mpirun</code> --mca plm_rsh_agent "oarsh" -machinefile $OAR_NODEFILE $HOME/src/mpi/tp}}
 
You can also set an environment variable (usually in your .bashrc):
{{Term|location=bashrc|cmd=<code class="command">export</code> OMPI_MCA_plm_rsh_agent=oarsh}}
 
Open MPI also provides a config file solution. In your home, create a file as <code class="file">~/.openmpi/mca-params.conf</code>
<pre class="brush: bash">
plm_rsh_agent=oarsh
filem_rsh_agent=oarcp
</pre>


You should have something like:
You should have something like:
Line 99: Line 83:
  helios-53      - 11 - 12 - 42
  helios-53      - 11 - 12 - 42


You may have (lot's of) warning messages if Open MPI doesn't find high performance hardware: don't be afraid, it's normal but you could use [[FAQ#How_to_use_MPI_in_Grid5000.3F]] to avoid them. This can looks like this:
You may have (lot's of) warning messages if Open MPI cannot take advantage of any high performance hardware. At this point of the tutorial, this is not important as we will learn how to select clusters with high performance interconnect in greater details below. Error messages might look like this:
[[2616,1],2]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:
Module: OpenFabrics (openib)
  Host: helios-8.sophia.grid5000.fr
Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
warning:regcache incompatible with malloc
warning:regcache incompatible with malloc
warning:regcache incompatible with malloc


or like this:
  [1637577186.512697] [taurus-10:2765 :0]      rdmacm_cm.c:638  UCX  ERROR rdma_create_event_channel failed: No such device
  [griffon-80.nancy.grid5000.fr:04866] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_mtl_mx: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
  [1637577186.512752] [taurus-10:2765 :0]    ucp_worker.c:1432 UCX  ERROR failed to open CM on component rdmacm with status Input/output error
  [griffon-80.nancy.grid5000.fr:04866] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_btl_mx: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
  [taurus-10.lyon.grid5000.fr:02765] ../../../../../../ompi/mca/pml/ucx/pml_ucx.c:273 Error: Failed to create UCP worker
  [griffon-80.nancy.grid5000.fr:04865] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_mtl_mx: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
[griffon-80.nancy.grid5000.fr:04867] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_mtl_mx: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
  ...


exit the job.
To tell OpenMPI not try to use high performance hardware and avoid those warnings message, use the following options:
 
{{Term|location=node|cmd=<code class="command"> mpirun --mca pml ^ucx -machinefile $OAR_NODEFILE  $HOME/mpi_programm</code>}}
 
{{Note|text=In case you submitted a job that does not reserve entire nodes (all CPU cores of all nodes), you have to use <code class=command>oarsh</code> as MPI remote shell connector. To do so with Open MPI, you can add <code class=command>--mca orte_rsh_agent "oarsh"</code> to your <code class=command>mpirun</code> command line. Open MPI will then use <code class=command>oarsh</code> in place of <code class=command>ssh</code>.
 
{{Term|location=node|cmd=<code class="command">mpirun</code> --mca orte_rsh_agent "oarsh" -machinefile $OAR_NODEFILE ~/mpi/tp}}
 
You can also set an environment variable (usually in your .bashrc):
{{Term|location=bashrc|cmd=<code class="command">export</code> OMPI_MCA_orte_rsh_agent=oarsh}}
{{Term|location=node|cmd=<code class="command">mpirun</code> -machinefile $OAR_NODEFILE ~/mpi/tp}}
 
Open MPI also provides a configuration file for <code class=command>--mca</code> parameters. In your home directory, create a file as <code class="file">~/.openmpi/mca-params.conf</code>
<pre class="brush: bash">
orte_rsh_agent=oarsh
filem_rsh_agent=oarcp
</pre>
}}


== Setting up and starting Open MPI to use high performance interconnect ==
== Setting up and starting Open MPI to use high performance interconnect ==
By default, Open MPI tries to use any high performance interconnect he can find. But it works only if the related libraries were found during the compilation of Open Mpi (not during the compilation of your application). It should work if you built Open MPI on a wheezy-x64 environment, and it also works correctly on the default environment.


{{Note|text= If you want to disable support for high performance network, use '''--mca btl self,sm,tcp''', but beware: Open MPI will use all the TCP networks available, and so IPoverIB will be used also if available; use '''--mca btl_tcp_if_exclude ib0,lo,myri0''' to disable also IP emulation of high performance interconnect}}
Open MPI provides several alternative ''compononents'' to use High Performance Interconnect hardware (such as Infiniband or Omni-Path).
 
MCA parameters ('''--mca''') can be used to select the component that are used at run-time by OpenMPI. To learn more about the MCA parameters, see also:
* [https://www.open-mpi.org/faq/?category=tuning#mca-params The Open MPI FAQ about tuning parameters]
* [http://www.open-mpi.org/faq/?category=tcp#tcp-selection How do I tell Open MPI which IP interfaces / networks to use?]
* [http://www.open-mpi.org/faq/?category=openfabrics The Open MPI documentation] about [https://en.wikipedia.org/wiki/OpenFabrics_Alliance OpenFabrics] (ie: [https://en.wikipedia.org/wiki/InfiniBand Infiniband])
* [https://www.open-mpi.org/faq/?category=opa The Open MPI documention] about Omni-path. [https://www.intel.com/content/www/us/en/support/articles/000016242/network-and-i-o/fabric-products.html Intel documentation about Omni-Path tools].
 
Open MPI packaged in Grid'5000 debian11 includes [https://openucx.org/ ''ucx''], [https://ofiwg.github.io/libfabric/ ''ofi''] and ''openib'' components to make use of Infiniband network, and [https://github.com/cornelisnetworks/opa-psm2 ''psm2''] and ''ofi'' to make use of Omnipath network.
 
If you want some of these components, you can for example use '''--mca pml ^ucx --mca mtl ^psm2,ofi --mca btl ^ofi,openib'''. This disables all high performance components mentionned above and will force Open MPI to use its TCP backend.
 


We will be using the Netpipe tool to check if the high performance interconnect is really used: download it from this URL: http://pkgs.fedoraproject.org/repo/pkgs/NetPIPE/NetPIPE-3.7.1.tar.gz/5f720541387be065afdefc81d438b712/NetPIPE-3.7.1.tar.gz
=== Infiniband network ===


{{Warning|text=NetPipe runs only between two MPI processes. Please make sure you have only two lines within your <code>node</code> file. Otherwise, MPI will launch more than two processes, which is incompatible with NetPipe.}}
By default, OpenMPI tries to use Infiniband high performance interconnect using the [https://openucx.org/ UCX] component. When Infiniband network is available, it will provide best results in most of cases (UCX even uses multiple interfaces at a time when available).


{{Term|location=frontend|cmd=<code class="command">wget</code> http://pkgs.fedoraproject.org/repo/pkgs/NetPIPE/NetPIPE-3.7.1.tar.gz/5f720541387be065afdefc81d438b712/NetPIPE-3.7.1.tar.gz }}
=== Omni-Path network ===
{{Note|text= Think to configure proxy if <code class="command">wget</code> freezes on "connecting", see [https://www.grid5000.fr/mediawiki/index.php/Getting_Started#Customizing_nodes_and_accessing_the_Internet concerned part in Getting Started] }}
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I}}
{{Term|location=node|cmd=<code class="command">cd</code> $HOME/src/mpi}}
Unarchive Netpipe
{{Term|location=node|cmd=<code class="command">tar</code> zvxf ~/NetPIPE-3.7.1.tar.gz}}
{{Term|location=node|cmd=<code class="command">cd</code> NetPIPE-3.7.1}}
Compile
{{Term|location=node|cmd=<code class="command">make</code> mpi}}


=== Infiniband hardware :  ===
For Open MPI to work with Omni-Path network hardware, PSM2 component must be used. It is recommended to explicitly disable other components (otherwise, OpenMPI will try to load other components which may produce error messages or provide lower performance):
Infiniband hardware is available on severals sites (see [https://www.grid5000.fr/mediawiki/index.php/Special:G5KHardware#High_performance_network_families Hardware page]):
* Rennes (20G)
* Nancy (20G)
* Grenoble (20G & 40G)


To reserve one core on two nodes with a 10G infiniband interconnect:
{{Term|location=node|cmd=<code class="command">mpirun</code> -machinefile $OAR_NODEFILE -mca mtl psm2 -mca pml ^ucx,ofi -mca btl ^ofi,openib  ~/mpi/tp}}


{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -l /nodes=2/core=1 -p "ib10g='YES'"}}
=== IP over Infiniband or Omni-Path ===
or for 20G:
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -l /nodes=2/core=1  -p "ib20g='YES'"}}


To test the network:
Nodes with Infiniband or Omni-Path network interfaces also provide an '''IP over Infiniband''' interface (these interfaces are named '''ibX'''). The TCP backend of Open MPI will try to use them by default.
{{Term|location=node|cmd=<code class="command">cd</code> $HOME/src/mpi/NetPIPE-3.7.1}}
{{Term|location=node|cmd=<code class="command">mpirun</code> --mca  plm_rsh_agent "oarsh" -machinefile $OAR_NODEFILE NPmpi}}


To check if the support for infiniband is available in Open MPI, run:
You can explicitely select interfaces used by the TCP backend using for instance '''--mca btl_tcp_if_exclude ib0,lo''' (to avoid using IP over Infiniband and local interfaces) or '''--mca btl_tcp_if_include eno2''' (to force using the 'regular' Ethernet interface eno2).


{{Term|location=node|cmd=<code class="command">ompi_info</code> &#124; grep openib}}
you should see something like this:
                MCA btl: openib (MCA v2.0, API v2.0, Component v1.4.1)


With Infiniband 40G (QDR), you should have much better performance that using ethernet or Myrinet 2G:
== Benckmarking ==


  0:       1 bytes  30716 times -->      4.53 Mbps in      1.68 usec
We will be using [http://mvapich.cse.ohio-state.edu/benchmarks/ OSU micro benchmark] to check the performances of high performance interconnects.
  1:      2 bytes  59389 times -->      9.10 Mbps in      1.68 usec
...
121: 8388605 bytes    17 times -->  25829.13 Mbps in    2477.82 usec
122: 8388608 bytes    20 times -->  25841.35 Mbps in    2476.65 usec
123: 8388611 bytes    20 times -->  25823.40 Mbps in    2478.37 usec


Less than 2 microsec in latency and almost 26Gbit/s in bandwitdh !
To download, extract and compile our benchmark, do:


=== Myrinet hardware : ===
{{Term|location=frontend|cmd=<code class="command">cd</code> ~/mpi<br>
Myrinet hardware is available on severals sites (see [https://www.grid5000.fr/mediawiki/index.php/Special:G5KHardware#High_performance_network_families Hardware page]):
<code class="command">wget</code> https://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-5.8.tgz<br>
* Lille (10G)
<code class="command">tar</code> xf osu-micro-benchmarks-5.8.tgz<br>
* Sophia (10G)  
<code class="command">cd</code> osu-micro-benchmarks-5.8/<br>
<code class="command">./configure </code> CC=$(which mpicc) CXX=$(which mpicxx)<br>
<code class="command">make</code>}}


To reserve one core on two nodes with a 10G Myrinet interconnect:<br>
As we will benchmark two MPI processes, reserve only one core in two distinct nodes.
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -l nodes=2}}


Myrinet 10G:
To start the network benchmark, use:
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -l /nodes=2/core=1 -p "myri10g='YES'"}}
{{Term|location=node|cmd=<code class="command">mpirun</code>mpirun -machinefile $OAR_NODEFILE -npernode 1 ~/mpi/osu-micro-benchmarks-5.8/mpi/pt2pt/osu_latency
}}


To test the network:
The option `-npernode 1` tells to only spawn one process on each node, as the benchmark requires only two processes to communicate.
{{Term|location=node|cmd=<code class="command">cd</code> $HOME/src/mpi/NetPIPE-3.7.1}}
{{Term|location=node|cmd=<code class="command">mpirun</code> --mca  plm_rsh_agent "oarsh" -machinefile $OAR_NODEFILE NPmpi}}


you should have something like that:
You can then try to compare performance of various network hardware available on Grid'5000. See for instance Network section of [[Hardware#Network_interface_models|Hardware page]].
  0:        1 bytes  4080 times -->      0.31 Mbps in      24.40 usec   
  1:        2 bytes  4097 times -->      0.63 Mbps in      24.36 usec   
  ...
  122: 8388608 bytes      3 times -->    896.14 Mbps in  71417.13 usec
  123: 8388611 bytes      3 times -->    896.17 Mbps in  71414.83 usec


The minimum latency is given by the last column for a 1 byte message;
OAR can select nodes according to properties related to network performance. For example:
the maximum throughput is given by the last line, 896.17 Mbps in this case.
* To reserve one core of two distinct nodes with a 56Gbps InfiniBand interconnect:
So in this case a latency of 24usec is very high, therefore the myrinet was not used as expected. It can happen if Open MPI has not found the mx libraries during the compilation. You can check this with :
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -l /nodes=2/core=1 -p "ib_rate=56"}}
{{Term|location=node|cmd=<code class="command">ompi_info </code> &#124; grep mx }}
* To reserve one core of two distinct nodes with a 100Gbps Omni-Path interconnect:
If the output is empty, there is no mx support builtin.
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -l /nodes=2/core=1 -p "opa_rate=100"}}


With a myrinet2G network, typical result should looks like this:
== Use a newer Open MPI version using modules ==


This time we have:
Newer Open MPI versions are available as module. To list available versions, use this command.
  0:      1 bytes  23865 times -->      2.03 Mbps in      3.77 usec   
  1:      2 bytes  26549 times -->      4.05 Mbps in      3.77 usec   
...
122: 8388608 bytes      3 times -->  1773.88 Mbps in  36079.17 usec
123: 8388611 bytes      3 times -->  1773.56 Mbps in  36085.69 usec


In this example, we have 3.77usec and almost 1.8Gbps
{{Term|location=frontend|cmd=<code class="command">module av openmpi/</code>}}
<pre>
[...]
openmpi/4.1.4_gcc-10.4.0
[...]
</pre>


= More advanced use cases=
This will load chosen Open MPI to execution your environment:
 
{{Term|location=frontend|cmd=<code class="command">module load openmpi/4.1.4_gcc-10.4.0</code><br>
<code class="command">mpirun --version</code>}}


== Running MPI on several sites at once ==
You must recompile simple MPI example on the frontend with this new version.


In this tutorial, we use the following sites: rennes, sophia and grenoble. But you can now use any site. For running MPI applications on several site, we will be using oargrid. See [[Grid_jobs_management]] tutorial for more information.
{{Term|location=frontend|cmd=<code class="command">mpicc ~/mpi/tp.c -o ~/mpi/tp</code>}}


{{Warning|text=There is still a problem when using lille and luxembourg nodes simultaneously.}}
From your job, you must ensure the same Open MPI version is used on every nodes:


{{Warning|text=Open MPI tries to figure out the best network interface at run time, and he also assumes that some networks are not routed between sites. To avoid this kind of problems, we must add the option '''--mca opal_net_private_ipv4 "192.168.160.0/24\;192.168.14.0/23" --mca btl_tcp_if_exclude ib0,lo,myri0 ''' to mpirun}}
{{Term|location=frontend|cmd=<code class="command">oarsub -I -l nodes=3</code>}}


{{Note|text=For multiple sites, we may want to only use tcp , and not native mx and native infiniband; to do this, add this option to mpirun: '''--mca btl self,sm,tcp'''}}
{{Term|location=node|cmd=<code class="command">module load openmpi/4.1.4_gcc-10.4.0</code><br>
<code class="command">$(which mpirun) -machinefile $OAR_NODEFILE ~/mpi/tp</code>}}


Note that <code>$(which mpirun)</code> command is used in this last step to ensure mpirun from the module environment is used.


Synchronize the src/mpi directory, from the frontend (the tp binary must be available on all sites), to the two other sites. Here we supposed we are connected on sophia, and we want to synchronize to grenoble and rennes.
== Installing Open MPI in a conda environment ==
{{Term|location=frontend.sophia|cmd=<code class="command">ssh</code> rennes mkdir -p src/mpi/}}
{{Term|location=frontend.sophia|cmd=<code class="command">rsync</code> --delete -avz ~/src/mpi/ rennes.grid5000.fr:src/mpi/}}
{{Term|location=frontend.sophia|cmd=<code class="command">ssh</code> grenoble mkdir -p src/mpi/}}
{{Term|location=frontend.sophia|cmd=<code class="command">rsync</code> --delete -avz ~/src/mpi/ grenoble.grid5000.fr:src/mpi/}}


Reserve nodes on the 3 sites with oargridsub (you can reserve nodes from specific clusters if you want to).
Here's an example of installing Open MPI in a conda environment optimized with ucx to use the cluster's high-bandwidth and low-latency.
{{Term|location=frontend|cmd=<code class="command">oargridsub</code>  -w 02:00:00 <code class="replace">rennes</code>:rdef="nodes=2",<code class="replace">grenoble</code>:rdef="nodes=2",<code class="replace">sophia</code>:rdef="nodes=2" > oargrid.out}}
Get the oargrid Id and Job key from the output of oargridsub:
{{Term|location=frontend|cmd=<code class="command">export</code> OAR_JOB_KEY_FILE=`grep "SSH KEY" oargrid.out &#124; cut -f2 -d: &#124; tr -d " "`}}
{{Term|location=frontend|cmd=<code class="command">export</code> OARGRID_JOB_ID=`grep "Grid reservation id" oargrid.out &#124; cut -f2 -d=`}}
Get the node list using oargridstat and copy the list to the first node:
{{Term|location=frontend|cmd=<code class="command">oargridstat</code> -w -l $OARGRID_JOB_ID  &#124; grep -v ^$ > ~/gridnodes}}
{{Term|location=frontend|cmd=<code class="command">oarcp</code> ~/gridnodes `head -1 ~/gridnodes`:}}
Connect to the first node:
{{Term|location=frontend|cmd=<code class="command">oarsh</code> `head -1 ~/gridnodes`}}
And run your MPI application:
{{Term|location=node|cmd=<code class="command">cd</code> $HOME/src/mpi/}}
{{Term|location=node|cmd=<code class="command">mpirun</code> -machinefile ~/gridnodes --mca plm_rsh_agent "oarsh" --mca opal_net_private_ipv4 "192.168.160.0/24\;192.168.14.0/23" --mca btl_tcp_if_exclude ib0,lo,myri0 --mca btl self,sm,tcp tp}}


==Compilation [optional]==
We work on Grenoble site


If you want to use a custom version of Open MPI, you can compile it in your home directory.
UCX exposes a set of abstract communication primitives that utilize the best of available hardware resources and offloads. These include RDMA (InfiniBand and RoCE), TCP, GPUs, shared memory, and network atomic operations.


* Make an interactive reservation and compile Open MPI on a node :
; Installation
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I}}


{{Term|location=node|cmd=cd /tmp/}}
* Install OpenMPI, ucx, and GCC from '''conda-forge''' channel in the <code><env_name></code> environment
* Get Open MPI (or here: http://www.open-mpi.org/software/ompi/v1.4/)
{{Term|location=fgrenoble|cmd=<code class="command">conda create -y -n <env_name></code><br>
{{Term|location=frontend|cmd=<code class="command">export</code> http_proxy=http://proxy:3128/}}
<code class="command">conda activate <env_name></code><br>
{{Term|location=frontend|cmd=<code class="command">wget</code> http://www.open-mpi.org/software/ompi/v1.4/downloads/openmpi-1.4.3.tar.gz}}
<code class="command">conda install -c conda-forge gcc_linux-64 openmpi ucx</code>}}
Unarchive Open MPI
Note: do not forget to create a dedicated environment before.
{{Term|location=node|cmd=tar -xf openmpi-1.4.3-g5k.tar.gz}}
{{Term|location=node|cmd=cd openmpi-1.4.3}}
configure (wait ~1mn30s)
{{Term|location=node|cmd=<code class="command">./configure</code> --prefix=$HOME/openmpi/  --with-memory-manager=none}}


The source tarball includes a small patch to make Open MPI works on grid5000 on several sites simultaneously. For the curious, the patch is:
; Test installation via NetPIPE


<pre class="brush: c">
* Install NetPIPE to test latency and network throughput (NetPIPE not available as conda package)
--- ompi/mca/btl/tcp/btl_tcp_proc.c.orig        2010-03-23 14:01:28.000000000 +0100
{{Term|location=fgrenoble|cmd=<code class="command">cd $HOME ; mkdir SRC && cd SRC</code><br>
+++ ompi/mca/btl/tcp/btl_tcp_proc.c    2010-03-23 14:01:50.000000000 +0100
<code class="command">wget https://src.fedoraproject.org/lookaside/pkgs/NetPIPE/NetPIPE-3.7.2.tar.gz/653071f785404bb68f8aaeff89fb1f33/NetPIPE-3.7.2.tar.gz</code><br>
@@ -496,7 +496,7 @@
<code class="command">tar zvxf NetPIPE-3.7.2.tar.gz</code><br>
                                local_interfaces[i]->ipv4_netmask)) {
<code class="command">cd NetPIPE-3.7.2/ && make mpi</code>}}
                        weights[i][j] = CQ_PRIVATE_SAME_NETWORK;
                    } else {
-                        weights[i][j] = CQ_PRIVATE_DIFFERENT_NETWORK;
+                        weights[i][j] = CQ_NO_CONNECTION;
                    }
                    best_addr[i][j] = peer_interfaces[j]->ipv4_endpoint_addr;
                }
</pre>
and compile: (wait ~2mn30s)
{{Term|location=node|cmd=<code class="command">make</code> -j4}}
* Install it on your home directory (in $HOME/openmpi/ )
{{Term|location=node|cmd=<code class="command">make install</code>}}


Then you can do the same steps as before, but with <code class='command'>$HOME/openmpi/bin/mpicc</code> and  <code class='command'>$HOME/openmpi/bin/mpirun</code>
* Reserve 2 cores on 2 separate nodes and enter into interactive session (ex: cluster dahu in Grenoble) :
{{Term|location=fgrenoble|cmd=<code class="command">oarsub -I -p dahu -l /nodes=2/core=1</code>}}


== Setting up and starting Open MPI on a kadeploy image ==
Note: choose an appropriate cluster with OminPath or InfiniBand Network connection to compare performance between two nodes using or not ucx driver. See [https://www.grid5000.fr/w/Hardware Grid'5000 Hardware Documentation].
=== Building a kadeploy image ===
The default Open MPI version available in debian based distributions is not compiled with high performances libraries like myrinet/MX, therefore we must recompile Open MPI from sources. Fortunately, every default image (wheezy-x64-XXX) but the min variant includes the libraries for high performance interconnects, and Open MPI will find them at compile time.


We will create a kadeploy image based on an existing one.
* On node dahu-X : Load conda, activate your conda environment, modify $PATH
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -t deploy -l nodes=1,walltime=2 }}
{{Term|location=dahu|cmd=<code class="command">module load conda</code><br>
{{Term|location=frontend|cmd=<code class="command">kadeploy3</code> -f $OAR_NODEFILE -e wheezy-x64-base -k }}
<code class="command">conda activate <env_name></code><br>
Download Open MPI tarball if you don't already have it:
<code class="command">export PATH=~/SRC/NetPIPE-3.7.2:$PATH</code>}}
{{Term|location=frontend|cmd=<code class="command">export</code> http_proxy=http://proxy:3128/}}
{{Term|location=frontend|cmd=<code class="command">wget</code> http://www.open-mpi.org/software/ompi/v1.4/downloads/openmpi-1.4.3.tar.gz}}
Copy Open MPI tarball on the first node:
{{Term|location=frontend|cmd=scp openmpi-1.4.3-g5k.tar.gz root@`head -1 $OAR_NODEFILE`:/tmp}}
Then connect on the deployed node as root, and install openmpi:
{{Term|location=frontend|cmd=<code class="command">ssh root@</code>`head -1 $OAR_NODEFILE`}}
Unarchive Open MPI:
{{Term|location=node|cmd=cd /tmp/}}
{{Term|location=node|cmd=tar -xf openmpi-1.4.3-g5k.tar.gz}}
{{Term|location=node|cmd=cd openmpi-1.4.3}}
Install gfortran, f2c and blas library:
{{Term|location=node|cmd=<code class="command">apt-get</code> -y install gfortran f2c libblas-dev}}
Configure and compile:
{{Term|location=node|cmd=./configure --libdir=/usr/local/lib64 --with-memory-manager=none}}
{{Term|location=node|cmd=make -j4}}
{{Term|location=node|cmd=make install}}


Create a dedicated user named mpi, in the group rdma (for infiniband)
* Run MPI without ucx (use standard network):   
<pre class="brush: bash">
{{Term|location=dahu|cmd=<code class="command">mpirun -np 2 --machinefile $OAR_NODEFILE --prefix $CONDA_PREFIX --mca plm_rsh_agent oarsh  NPmpi</code>}}
useradd -m -g rdma mpi -d /var/mpi
<pre>
echo "* hard memlock unlimited" >> /etc/security/limits.conf
0: dahu-3
echo "* soft memlock unlimited" >> /etc/security/limits.conf
1: dahu-30
mkdir ~mpi/.ssh
Now starting the main loop
cp ~/.ssh/authorized_keys ~mpi/.ssh
  0:      1 bytes  6400 times -->      0.54 Mbps in      14.21 usec
chown -R mpi ~mpi/.ssh
  1:      2 bytes  7035 times -->      1.07 Mbps in      14.20 usec
su - mpi
  2:      3 bytes  7043 times -->      1.61 Mbps in      14.22 usec
mkdir src
...
ssh-keygen -N "" -P "" -f /var/mpi/.ssh/id_rsa
116: 4194304 bytes    12 times -->   8207.76 Mbps in    3898.75 usec
cat .ssh/id_rsa.pub >> ~/.ssh/authorized_keys
117: 4194307 bytes    12 times -->  8161.45 Mbps in    3920.87 usec
echo "        StrictHostKeyChecking no" >> ~/.ssh/config
...
exit
exit
rsync -avz ~/src/mpi/ mpi@`head -1 $OAR_NODEFILE`:src/mpi/
ssh root@`head -1 $OAR_NODEFILE`
</pre>
</pre>
Create the image using tgz-g5k:
   
{{Term|location=node|cmd=<code class="command">tgz-g5k</code> /dev/shm/image.tgz}}
* Run MPI with ucx (use rapid network):
Disconnect from the node (exit). From the frontend, copy the image to the public directory:
{{Term|location=dahu|cmd=<code class="command">mpirun -np 2 --machinefile $OAR_NODEFILE --prefix $CONDA_PREFIX --mca plm_rsh_agent oarsh --mca pml ucx --mca osc ucx NPmpi</code>}}
{{Term|location=frontend|cmd=<code class="command">mkdir</code> -p $HOME/public}}
<pre>
{{Term|location=frontend|cmd=<code class="command">scp</code> root@`head -1 $OAR_NODEFILE`:/dev/shm/image.tgz $HOME/public/wheezy-openmpi.tgz}}
0: dahu-3
Copy the description file of wheezy-x64-base:
1: dahu-30
{{Term|location=frontend|cmd=grep -v visibility /grid5000/descriptions/wheezy-x64-base-1.4.dsc > $HOME/public/wheezy-openmpi.dsc}}
Now starting the main loop
Change the image name in the description file; we will use an http URL for multi-site deploiement:
  0:       1 bytes  19082 times -->      1.69 Mbps in      4.50 usec
<pre class="brush: bash">perl -i -pe "s@server:///grid5000/images/wheezy-x64-base-1.4.tgz@http://public.$(hostname | cut -d. -f2).grid5000.fr/~$USER/wheezy-openmpi.tgz@" $HOME/public/wheezy-openmpi.dsc
  1:      2 bytes  22201 times -->     3.08 Mbps in       4.95 usec
  2:       3 bytes  20212 times -->      4.46 Mbps in      5.13 usec
...
116: 4194304 bytes    46 times -->  30015.10 Mbps in    1066.13 usec
117: 4194307 bytes    46 times -->  30023.66 Mbps in    1065.83 usec
...
</pre>
</pre>
Now you can terminate the job:
{{Term|location=frontend|cmd=<code class="command">oardel</code> $OAR_JOB_ID}}


=== Using a kadeploy image ===
==== Single site  ====
{{Term|location=frontend|cmd=oarsub -I -t deploy -l /nodes=3}}
{{Term|location=frontend|cmd=kadeploy3 -a $HOME/public/wheezy-openmpi.dsc -f $OAR_NODEFILE -k}}


{{Term|location=frontend|cmd=<code class="command">scp</code> $OAR_NODEFILE  mpi@`head -1 $OAR_NODEFILE`:nodes}}
= Using other MPI implementations =
connect to the first node:
 
{{Term|location=frontend|cmd=<code class="command">ssh</code> mpi@`head -1 $OAR_NODEFILE`}}
Both MPICH and MVAPICH are available as modules on Grid5000.
{{Term|location=node|cmd=<code class="command">cd</code> $HOME/src/mpi/}}
Both have the same architecture when it comes to managing networks: they can both be compiled with UCX *or* with OFI, and therefore there are two versions of the modules available:
{{Term|location=node|cmd=<code class="command">/usr/local/bin/mpicc</code> tp.c -o tp}}
  - mpich/{version...}-ofi, which should be used on OmniPath interconnects.
{{Term|location=node|cmd=<code class="command">/usr/local/bin/mpirun</code> -machinefile ~/nodes ./tp}}
  - mpich/{version...}-ucx, which should be used for infiniband interconnects.
 
When in doubt, prefer the OFI version as it offers better performances when using the TCP modules.
The MPI implementation should automatically select the "best" network available and use it, but if at some point you want to explicitly set which transport layer to use, there are ways to do so for both OFI and UCX.
 
== Choosing the transport layer with OFI ==


==== Single site with Myrinet hardware ====
OFI uses libfabric to handles the underlying networks, and you can use a couple of variables to filter/select the transport layer to use.
The first one is <code class='command'>FI_PROVIDER</code>.


{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -t deploy -l /nodes=2 -p "myri<code class="replace">10</code>g='YES'"}}
The command <code class='command'>fi_info -l</code> should provide enough information about the fabrics available.
{{Term|location=frontend|cmd=<code class="command">kadeploy3</code> -k -a ~/public/wheezy-openmpi.dsc -f $OAR_NODEFILE}}
Typically, you can set it to <code class='command'>psm2</code> to force using OmniPath, or <code class='command'>tcp</code> to force communication to go through the tcp module (either over IP or over OmniPath depending on the interface used.


Create a nodefile with a single entry per node:
You can select the interface(s) to use for the tcp module by setting the <code class='command'>FI_TCP_IFACE</code> variable.
{{Term|location=frontend|cmd=<code class="command">uniq</code> $OAR_NODEFILE > nodes}}
Copy it to the first node:
{{Term|location=frontend|cmd=<code class="command">scp</code> nodes mpi@`head -1 nodes`:}}
connect to the first node:
{{Term|location=frontend|cmd=<code class="command">ssh</code> mpi@`head -1 nodes`}}
{{Term|location=node|cmd=<code class="command">cd</code> $HOME/src/mpi/NetPIPE-3.7.1}}
{{Term|location=node|cmd=<code class="command">/usr/local/bin/mpirun</code> -machinefile ~/nodes NPmpi}}


This time we have:
== Choosing the transport layer with UCX ==
  0:      1 bytes  23865 times -->      2.03 Mbps in      3.77 usec   
  1:      2 bytes  26549 times -->      4.05 Mbps in      3.77 usec   
...
122: 8388608 bytes      3 times -->  1773.88 Mbps in  36079.17 usec
123: 8388611 bytes      3 times -->  1773.56 Mbps in  36085.69 usec


This time we have 3.77usec, which is good, and almost 1.8Gbps. We are using the myrinet interconnect!
Similarly to OFI, you can use an environment variable to tell UCX to use a specific transport layer: <code class='command'>UCX_TLS</code>.


==== Multiple sites  ====
The command <code class='command'>ucx_info -d</code> to get information about the available transport layer.
Choose three clusters from 3 different sites.
{{Term|location=frontend|cmd=<code class="command">oargridsub</code> -t deploy -w 02:00:00 <code class="replace">cluster1</code>:rdef="nodes=2",<code class="replace">cluster2</code>:rdef="nodes=2",<code class="replace">cluster3</code>:rdef="nodes=2" > oargrid.out}}
{{Term|location=frontend|cmd=<code class="command">export</code> OARGRID_JOB_ID=`grep "Grid reservation id" oargrid.out &#124; cut -f2 -d=`}}
Get the node list using oargridstat:
{{Term|location=frontend|cmd=<code class="command">oargridstat</code> -w -l $OARGRID_JOB_ID &#124;grep grid > ~/gridnodes}}


You can select the interface(s) to use by setting the <code class='command'>UCX_NET_DEVICES</code> variable.
= More advanced use cases=
== Running MPI on several sites at once ==
In this section, we are going to execute a MPI process over several Grid'5000 sites. In this example we will use the following sites: Rennes, Sophia and Grenoble, using oargrid for making the reservation (see the [https://www.grid5000.fr/mediawiki/index.php/Advanced_OAR#Multi-site_jobs_with_OARGrid Advanced OAR] tutorial for more information).
{{Warning|text=Open MPI tries to figure out the best network interface to use at run time. However, selected networks are not always "production" Grid'5000 network which is routed between sites. In addition, only TCP implementation will work between sites, as high performance networks are only available from the inside of a site. To ensure correct network is selected, add the option '''--mca opal_net_private_ipv4 "192.168.0.0/16" --mca btl_tcp_if_exclude ib0,lo --mca btl self,cm,tcp --mca pml ^ucx,cm''' to mpirun}}
{{Warning|text=By default, Open MPI may use only the short name of the nodes specified into the nodesfile; but to join grid5000 nodes that are located on different sites, we must use the FQDN names. For Open Mpi to correctly use FQDN names of the nodes, you must add the following option to mpirun: '''--mca orte_keep_fqdn_hostnames t'''}}
The MPI program must be available on each site you want to use. From the frontend of one site, copy the mpi/ directory to the two other sites. You can do that with '''rsync'''. Suppose that you are connected in Sophia and that you want to copy Sophia's mpi/ directoy to Grenoble and Rennes.
{{Term|location=fsophia|cmd=<code class="command">rsync</code> -avz ~/mpi/ nancy.grid5000.fr:mpi/<br>
<code class="command">rsync</code> -avz ~/mpi/ grenoble.grid5000.fr:mpi/}}
(you can also add the ''--delete'' option to remove extraneous files from the mpi directory of Nancy and Grenoble).
Reserve nodes in each site from any frontend with Funk (you can also add options to reserve nodes from specific clusters if you want to):
{{Term|location=frontend|cmd=<code class="command">funk</code> -w 02:00:00 <code class="replace">nancy</code>:2,<code class="replace">grenoble</code>:2,<code class="replace">sophia</code>:2 --no-oargrid -y > funk.out}}
Get the node list using oarstat and copy the list to the first node:
{{Term|location=frontend|cmd=( for s in <code class="replace">nancy</code> <code class="replace">grenoble</code> <code class="replace">sophia</code>;<br/> do<br/><code class="command">ssh</code> $s "<code class="command">oarstat</code> -J -u" &#124; <code class="command">jq</code> '[.[] &#124; select(.state == "Running") &#124; .assigned_network_address[]]';<br/>done;<br/>) &#124; <code class="command">jq</code> -s 'flatten &#124; .[]' &#124; <code class="command">tr</code> -d \" > ~/gridnodes<br>
<code class="command">scp</code> ~/gridnodes $(head -1 ~/gridnodes):}}
Connect to the first node:
{{Term|location=frontend|cmd=<code class="command">ssh</code> $(head -1 ~/gridnodes)}}
And run your MPI application:
{{Term|location=node|cmd=<code class="command">cd</code> ~/mpi/<br>
<code class="command">mpirun</code> -machinefile ~/gridnodes --mca opal_net_private_ipv4 "192.168.0.0/16" --mca btl_tcp_if_exclude ib0,lo --mca btl self,vader,tcp --mca pml ^ucx,cm --mca orte_keep_fqdn_hostnames t tp}}
== MPI and GPU to GPU communications ==
Direct GPU to GPU exchange of data (without having to transit on system RAM) is available with OpenMPI. On Grid'5000, this feature is currently only supported for AMD GPUs on the same node (see {{Bug|13559}} and {{Bug|13540}} for progress on other use cases).
This feature is available using the [[#Use_a_newer_Open_MPI_version_using_modules|module version of OpenMPI]]. Will use OSU benchmark to demonstrate GPU to GPU communiction:
<pre>
module load hip
module load openmpi
wget http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-5.9.tar.gz
tar xf osu-micro-benchmarks-5.9.tar.gz
cd osu-micro-benchmarks-5.9/
./configure --enable-rocm --with-rocm=$(hipconfig -p) CC=mpicc CXX=mpicxx LDFLAGS="-L$(ompi_info | grep Prefix: | cut -d': ' -f2)/lib/ -lmpi -L$(hipconfig -p)/lib/ $(hipconfig -C) -lamdhip64" CPPFLAGS="-std=c++11"
make
</pre>
Then run OSU benchmark:
<pre>
mpirun -np 2 --mca pml ucx -x UCX_TLS=self,sm,rocm mpi/pt2pt/osu_bw -d rocm D D
</pre>


The benchmark will report GPU to GPU bandwidth.


Deploy on all sites using the --multi-server option :
See also https://github.com/openucx/ucx/wiki/Build-and-run-ROCM-UCX-OpenMPI
{{Term|location=frontend|cmd=<code class="command">kadeploy3</code> -f gridnodes -a $HOME/public/wheezy-openmpi.dsc -k --multi-server -o ~/nodes.deployed}}
{{Term|location=frontend|cmd=<code class="command">scp</code> ~/nodes.deployed mpi@`head -1 ~/nodes.deployed`:}}
connect to the first node:
{{Term|location=frontend|cmd=<code class="command">ssh</code> mpi@`head -1 ~/nodes.deployed`}}
{{Term|location=node|cmd=<code class="command">cd</code> $HOME/src/mpi/}}
{{Term|location=node|cmd=<code class="command">/usr/local/bin/mpirun</code> -machinefile ~/nodes.deployed --mca btl self,sm,tcp --mca opal_net_private_ipv4  "192.168.7.0/24\;192.168.162.0/24\;192.168.160.0/24\;172.24.192.0/18\;172.24.128.0/18\;192.168.200.0/23" tp}}


==Setting up and starting Open MPI on a default environment using allow_classic_ssh==
= FAQ =
Submit a job with the <code>allow_classic_ssh</code> type:
== Passing environment variables to nodes ==
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -t allow_classic_ssh -l nodes=3}}
While some batch schedulers (e.g. Slurm) transparently pass environment variables from the head node shell to all execution nodes given to <code class=command>mpirun</code>. OAR does not (OAR provides no more than what OpenSSH does, be it when using <code class=command>oarsh</code> or <code class=command>ssh</code> as the connector). Thus OAR leaves this responsibility of environment variables passing to mpirun.


Launch your parallel job:
Therefore, in order to have more than the default environment variables (<code class=replace>OMPI_*</code> variables) passed/set on execution nodes, one has different options:
{{Term|location=node|cmd=<code class="command">mpirun</code> -machinefile $OAR_NODEFILE $HOME/src/mpi/tp}}
; use the <code class=command>-x </code><code class=replace>VAR</code> option of <code class=command>mpirun</code>, possibly for each variable to pass: Example:
mpirun -machinefile $OAR_NODE_FILE -x MY_ENV1 -x MY_ENV2 -x MY_ENV3="value3" ~/bin/mpi_test
; use the <code class=command>--mca mca_base_env_list "ENV[;...]"</code> option of <code class=command>mpirun</code>: Example:
mpirun -machinefile $OAR_NODE_FILE --mca mca_base_env_list "MY_ENV1;MY_ENV2;MY_ENV3=value3" ~/bin/mpi_test
; set the <code class=command>mca_base_env_list "ENV[;...]"</code> option in the <code class=file>~/.openmpi/mca-params.conf</code> file: This way passing variable become transparent to the <code class=command>mpirun</code> command line, which becomes:
mpirun -machinefile $OAR_NODE_FILE ~/bin/mpi_test


== MPICH2 ==
;Rq: <code class=command>-x </code> and <code class=command>--mca mca_base_env_list</code> cannot coexist.


{{Warning|text=This documentation is about using MPICH2 with the MPD process manager. But the default process manager for MPICH2 is now Hydra. See also: [http://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager The MPICH documentation].}}
This could especially be useful to pass OpenMP variables, such as ''OMP_NUM_THREADS''.


If you want/need to use MPICH2 on Grid5000, you should do this:
More info [https://www.open-mpi.org/doc/v4.0/man1/mpirun.1.php#toc22 in OpenMPI manual pages].


First, you have to do this once (on each site)
{{Term|location=frontend|cmd=<code class="command">echo</code> "MPD_SECRETWORD=<code class="replace">secret</code>" > $HOME/.mpd.conf}}
{{Term|location=frontend|cmd=<code class="command">chmod</code> 600 $HOME/.mpd.conf}}


Then you can use a script like this to launch mpd/mpirun:
{{Pages|HPC}}
NODES=`uniq < $OAR_NODEFILE | wc -l | tr -d ' '`
NPROCS=`wc -l < $OAR_NODEFILE | tr -d ' '`
mpdboot --rsh=oarsh --totalnum=$NODES --file=$OAR_NODEFILE
sleep 1
mpirun -n $NPROCS <code class="replace">mpich2binary</code>

Latest revision as of 15:40, 5 August 2024

Note.png Note

This page is actively maintained by the Grid'5000 team. If you encounter problems, please report them (see the Support page). Additionally, as it is a wiki page, you are free to make minor corrections yourself if needed. If you would like to suggest a more fundamental change, please contact the Grid'5000 team.

Introduction

MPI is a programming interface that enables the communication between processes of a distributed memory system. This tutorial focuses on setting up MPI environments on Grid'5000 and only requires a basic understanding of MPI concepts. For instance, you should know that standard MPI processes live in their own memory space and communicate with other processes by calling library routines to send and receive messages. For a comprehensive tutorials on MPI, see the IDRIS course on MPI. There are several freely-available implementations of MPI, including Open MPI, MPICH2, MPICH, LAM, etc. In this practical session, we focus on the Open MPI implementation.

Before following this tutorial you should already have some basic knowledge of OAR (see the Getting Started tutorial). For the second part of this tutorial, you should also know the basics about Multi-site reservation (see the Advanced OAR tutorial).

Running MPI on Grid'5000

When attempting to run MPI on Grid'5000 you will face a number of challenges, ranging from classical setup problems for MPI software to problems specific to Grid'5000. This practical session aims at driving you through the most common use cases, which are:

  • Setting up and starting Open MPI on a default environment.
  • Setting up and starting Open MPI to use high performance interconnect.
  • Setting up and starting latest Open MPI library version.
  • Setting up and starting Open MPI to run on several sites using funk.

Using Open MPI on a default environment

The default Grid'5000 environment provides Open MPI 4.1.0 (see ompi_info).

Creating a sample MPI program

For the purposes of this tutorial, we create a simple MPI program where the MPI process of rank 0 broadcasts an integer (42) to all the other processes. Then, each process prints its rank, the total number of processes and the value it received from the process 0.

In your home directory, create a file ~/mpi/tp.c and copy the source code:

Terminal.png frontend:
mkdir ~/mpi
vi ~/mpi/tp.c
#include <stdio.h>
#include <mpi.h>
#include <time.h> /* for the work function only */
#include <unistd.h>

int main (int argc, char *argv []) {
       char hostname[257];
       int size, rank;
       int bcast_value = 1;

       gethostname(hostname, sizeof hostname);
       MPI_Init(&argc, &argv);
       MPI_Comm_rank(MPI_COMM_WORLD, &rank);
       MPI_Comm_size(MPI_COMM_WORLD, &size);
       if (!rank) {
            bcast_value = 42;
       }
       MPI_Bcast(&bcast_value,1 ,MPI_INT, 0, MPI_COMM_WORLD );
       printf("%s\t- %d - %d - %d\n", hostname, rank, size, bcast_value);
       fflush(stdout);

       MPI_Barrier(MPI_COMM_WORLD);
       MPI_Finalize();
       return 0;
}

You can then compile your code:

Terminal.png frontend:
mpicc ~/mpi/tp.c -o ~/mpi/tp

Setting up and starting Open MPI on a default environment

Submit a job:

Terminal.png frontend:
oarsub -I -l nodes=3

OAR provides the $OAR_NODEFILE file, containing the list of nodes of the job (one line per CPU core). You can use it to launch your parallel job:

Terminal.png node:
mpirun -machinefile $OAR_NODEFILE ~/mpi/tp

You should have something like:

helios-52       - 4 - 12 - 42
helios-51       - 0 - 12 - 42
helios-52       - 5 - 12 - 42
helios-51       - 2 - 12 - 42
helios-52       - 6 - 12 - 42
helios-51       - 1 - 12 - 42
helios-51       - 3 - 12 - 42
helios-52       - 7 - 12 - 42
helios-53       - 8 - 12 - 42
helios-53       - 9 - 12 - 42
helios-53       - 10 - 12 - 42
helios-53       - 11 - 12 - 42

You may have (lot's of) warning messages if Open MPI cannot take advantage of any high performance hardware. At this point of the tutorial, this is not important as we will learn how to select clusters with high performance interconnect in greater details below. Error messages might look like this:

[1637577186.512697] [taurus-10:2765 :0]      rdmacm_cm.c:638  UCX  ERROR rdma_create_event_channel failed: No such device
[1637577186.512752] [taurus-10:2765 :0]     ucp_worker.c:1432 UCX  ERROR failed to open CM on component rdmacm with status Input/output error
[taurus-10.lyon.grid5000.fr:02765] ../../../../../../ompi/mca/pml/ucx/pml_ucx.c:273  Error: Failed to create UCP worker

To tell OpenMPI not try to use high performance hardware and avoid those warnings message, use the following options:

Terminal.png node:
mpirun --mca pml ^ucx -machinefile $OAR_NODEFILE $HOME/mpi_programm
Note.png Note

In case you submitted a job that does not reserve entire nodes (all CPU cores of all nodes), you have to use oarsh as MPI remote shell connector. To do so with Open MPI, you can add --mca orte_rsh_agent "oarsh" to your mpirun command line. Open MPI will then use oarsh in place of ssh.

Terminal.png node:
mpirun --mca orte_rsh_agent "oarsh" -machinefile $OAR_NODEFILE ~/mpi/tp

You can also set an environment variable (usually in your .bashrc):

Terminal.png bashrc:
export OMPI_MCA_orte_rsh_agent=oarsh
Terminal.png node:
mpirun -machinefile $OAR_NODEFILE ~/mpi/tp

Open MPI also provides a configuration file for --mca parameters. In your home directory, create a file as ~/.openmpi/mca-params.conf

orte_rsh_agent=oarsh
filem_rsh_agent=oarcp

Setting up and starting Open MPI to use high performance interconnect

Open MPI provides several alternative compononents to use High Performance Interconnect hardware (such as Infiniband or Omni-Path).

MCA parameters (--mca) can be used to select the component that are used at run-time by OpenMPI. To learn more about the MCA parameters, see also:

Open MPI packaged in Grid'5000 debian11 includes ucx, ofi and openib components to make use of Infiniband network, and psm2 and ofi to make use of Omnipath network.

If you want some of these components, you can for example use --mca pml ^ucx --mca mtl ^psm2,ofi --mca btl ^ofi,openib. This disables all high performance components mentionned above and will force Open MPI to use its TCP backend.


Infiniband network

By default, OpenMPI tries to use Infiniband high performance interconnect using the UCX component. When Infiniband network is available, it will provide best results in most of cases (UCX even uses multiple interfaces at a time when available).

Omni-Path network

For Open MPI to work with Omni-Path network hardware, PSM2 component must be used. It is recommended to explicitly disable other components (otherwise, OpenMPI will try to load other components which may produce error messages or provide lower performance):

Terminal.png node:
mpirun -machinefile $OAR_NODEFILE -mca mtl psm2 -mca pml ^ucx,ofi -mca btl ^ofi,openib ~/mpi/tp

IP over Infiniband or Omni-Path

Nodes with Infiniband or Omni-Path network interfaces also provide an IP over Infiniband interface (these interfaces are named ibX). The TCP backend of Open MPI will try to use them by default.

You can explicitely select interfaces used by the TCP backend using for instance --mca btl_tcp_if_exclude ib0,lo (to avoid using IP over Infiniband and local interfaces) or --mca btl_tcp_if_include eno2 (to force using the 'regular' Ethernet interface eno2).


Benckmarking

We will be using OSU micro benchmark to check the performances of high performance interconnects.

To download, extract and compile our benchmark, do:

Terminal.png frontend:
cd ~/mpi

wget https://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-5.8.tgz
tar xf osu-micro-benchmarks-5.8.tgz
cd osu-micro-benchmarks-5.8/
./configure CC=$(which mpicc) CXX=$(which mpicxx)

make

As we will benchmark two MPI processes, reserve only one core in two distinct nodes.

Terminal.png frontend:
oarsub -I -l nodes=2

To start the network benchmark, use:

Terminal.png node:
mpirunmpirun -machinefile $OAR_NODEFILE -npernode 1 ~/mpi/osu-micro-benchmarks-5.8/mpi/pt2pt/osu_latency

The option `-npernode 1` tells to only spawn one process on each node, as the benchmark requires only two processes to communicate.

You can then try to compare performance of various network hardware available on Grid'5000. See for instance Network section of Hardware page.

OAR can select nodes according to properties related to network performance. For example:

  • To reserve one core of two distinct nodes with a 56Gbps InfiniBand interconnect:
Terminal.png frontend:
oarsub -I -l /nodes=2/core=1 -p "ib_rate=56"
  • To reserve one core of two distinct nodes with a 100Gbps Omni-Path interconnect:
Terminal.png frontend:
oarsub -I -l /nodes=2/core=1 -p "opa_rate=100"

Use a newer Open MPI version using modules

Newer Open MPI versions are available as module. To list available versions, use this command.

Terminal.png frontend:
module av openmpi/
[...]
openmpi/4.1.4_gcc-10.4.0
[...]

This will load chosen Open MPI to execution your environment:

Terminal.png frontend:
module load openmpi/4.1.4_gcc-10.4.0
mpirun --version

You must recompile simple MPI example on the frontend with this new version.

Terminal.png frontend:
mpicc ~/mpi/tp.c -o ~/mpi/tp

From your job, you must ensure the same Open MPI version is used on every nodes:

Terminal.png frontend:
oarsub -I -l nodes=3
Terminal.png node:
module load openmpi/4.1.4_gcc-10.4.0
$(which mpirun) -machinefile $OAR_NODEFILE ~/mpi/tp

Note that $(which mpirun) command is used in this last step to ensure mpirun from the module environment is used.

Installing Open MPI in a conda environment

Here's an example of installing Open MPI in a conda environment optimized with ucx to use the cluster's high-bandwidth and low-latency.

We work on Grenoble site

UCX exposes a set of abstract communication primitives that utilize the best of available hardware resources and offloads. These include RDMA (InfiniBand and RoCE), TCP, GPUs, shared memory, and network atomic operations.

Installation
  • Install OpenMPI, ucx, and GCC from conda-forge channel in the <env_name> environment
Terminal.png fgrenoble:
conda create -y -n <env_name>

conda activate <env_name>

conda install -c conda-forge gcc_linux-64 openmpi ucx

Note: do not forget to create a dedicated environment before.

Test installation via NetPIPE
  • Install NetPIPE to test latency and network throughput (NetPIPE not available as conda package)
Terminal.png fgrenoble:
cd $HOME ; mkdir SRC && cd SRC

wget https://src.fedoraproject.org/lookaside/pkgs/NetPIPE/NetPIPE-3.7.2.tar.gz/653071f785404bb68f8aaeff89fb1f33/NetPIPE-3.7.2.tar.gz
tar zvxf NetPIPE-3.7.2.tar.gz

cd NetPIPE-3.7.2/ && make mpi
  • Reserve 2 cores on 2 separate nodes and enter into interactive session (ex: cluster dahu in Grenoble) :
Terminal.png fgrenoble:
oarsub -I -p dahu -l /nodes=2/core=1

Note: choose an appropriate cluster with OminPath or InfiniBand Network connection to compare performance between two nodes using or not ucx driver. See Grid'5000 Hardware Documentation.

  • On node dahu-X : Load conda, activate your conda environment, modify $PATH
Terminal.png dahu:
module load conda

conda activate <env_name>

export PATH=~/SRC/NetPIPE-3.7.2:$PATH
  • Run MPI without ucx (use standard network):
Terminal.png dahu:
mpirun -np 2 --machinefile $OAR_NODEFILE --prefix $CONDA_PREFIX --mca plm_rsh_agent oarsh NPmpi
0: dahu-3
1: dahu-30
Now starting the main loop
  0:       1 bytes   6400 times -->      0.54 Mbps in      14.21 usec
  1:       2 bytes   7035 times -->      1.07 Mbps in      14.20 usec
  2:       3 bytes   7043 times -->      1.61 Mbps in      14.22 usec
...
116: 4194304 bytes     12 times -->   8207.76 Mbps in    3898.75 usec
117: 4194307 bytes     12 times -->   8161.45 Mbps in    3920.87 usec
...
  • Run MPI with ucx (use rapid network):
Terminal.png dahu:
mpirun -np 2 --machinefile $OAR_NODEFILE --prefix $CONDA_PREFIX --mca plm_rsh_agent oarsh --mca pml ucx --mca osc ucx NPmpi
0: dahu-3
1: dahu-30
Now starting the main loop
  0:       1 bytes  19082 times -->      1.69 Mbps in       4.50 usec
  1:       2 bytes  22201 times -->      3.08 Mbps in       4.95 usec
  2:       3 bytes  20212 times -->      4.46 Mbps in       5.13 usec
...
116: 4194304 bytes     46 times -->  30015.10 Mbps in    1066.13 usec
117: 4194307 bytes     46 times -->  30023.66 Mbps in    1065.83 usec
...


Using other MPI implementations

Both MPICH and MVAPICH are available as modules on Grid5000. Both have the same architecture when it comes to managing networks: they can both be compiled with UCX *or* with OFI, and therefore there are two versions of the modules available:

 - mpich/{version...}-ofi, which should be used on OmniPath interconnects.
 - mpich/{version...}-ucx, which should be used for infiniband interconnects.

When in doubt, prefer the OFI version as it offers better performances when using the TCP modules. The MPI implementation should automatically select the "best" network available and use it, but if at some point you want to explicitly set which transport layer to use, there are ways to do so for both OFI and UCX.

Choosing the transport layer with OFI

OFI uses libfabric to handles the underlying networks, and you can use a couple of variables to filter/select the transport layer to use. The first one is FI_PROVIDER.

The command fi_info -l should provide enough information about the fabrics available. Typically, you can set it to psm2 to force using OmniPath, or tcp to force communication to go through the tcp module (either over IP or over OmniPath depending on the interface used.

You can select the interface(s) to use for the tcp module by setting the FI_TCP_IFACE variable.

Choosing the transport layer with UCX

Similarly to OFI, you can use an environment variable to tell UCX to use a specific transport layer: UCX_TLS.

The command ucx_info -d to get information about the available transport layer.

You can select the interface(s) to use by setting the UCX_NET_DEVICES variable.

More advanced use cases

Running MPI on several sites at once

In this section, we are going to execute a MPI process over several Grid'5000 sites. In this example we will use the following sites: Rennes, Sophia and Grenoble, using oargrid for making the reservation (see the Advanced OAR tutorial for more information).

Warning.png Warning

Open MPI tries to figure out the best network interface to use at run time. However, selected networks are not always "production" Grid'5000 network which is routed between sites. In addition, only TCP implementation will work between sites, as high performance networks are only available from the inside of a site. To ensure correct network is selected, add the option --mca opal_net_private_ipv4 "192.168.0.0/16" --mca btl_tcp_if_exclude ib0,lo --mca btl self,cm,tcp --mca pml ^ucx,cm to mpirun

Warning.png Warning

By default, Open MPI may use only the short name of the nodes specified into the nodesfile; but to join grid5000 nodes that are located on different sites, we must use the FQDN names. For Open Mpi to correctly use FQDN names of the nodes, you must add the following option to mpirun: --mca orte_keep_fqdn_hostnames t

The MPI program must be available on each site you want to use. From the frontend of one site, copy the mpi/ directory to the two other sites. You can do that with rsync. Suppose that you are connected in Sophia and that you want to copy Sophia's mpi/ directoy to Grenoble and Rennes.

Terminal.png fsophia:
rsync -avz ~/mpi/ nancy.grid5000.fr:mpi/
rsync -avz ~/mpi/ grenoble.grid5000.fr:mpi/

(you can also add the --delete option to remove extraneous files from the mpi directory of Nancy and Grenoble).

Reserve nodes in each site from any frontend with Funk (you can also add options to reserve nodes from specific clusters if you want to):

Terminal.png frontend:
funk -w 02:00:00 nancy:2,grenoble:2,sophia:2 --no-oargrid -y > funk.out

Get the node list using oarstat and copy the list to the first node:

Terminal.png frontend:
( for s in nancy grenoble sophia;
do
ssh $s "oarstat -J -u" | jq '[.[] | select(.state == "Running") | .assigned_network_address[]]';
done;
) | jq -s 'flatten | .[]' | tr -d \" > ~/gridnodes
scp ~/gridnodes $(head -1 ~/gridnodes):

Connect to the first node:

Terminal.png frontend:
ssh $(head -1 ~/gridnodes)

And run your MPI application:

Terminal.png node:
cd ~/mpi/
mpirun -machinefile ~/gridnodes --mca opal_net_private_ipv4 "192.168.0.0/16" --mca btl_tcp_if_exclude ib0,lo --mca btl self,vader,tcp --mca pml ^ucx,cm --mca orte_keep_fqdn_hostnames t tp

MPI and GPU to GPU communications

Direct GPU to GPU exchange of data (without having to transit on system RAM) is available with OpenMPI. On Grid'5000, this feature is currently only supported for AMD GPUs on the same node (see bug #13559 and bug #13540 for progress on other use cases).

This feature is available using the module version of OpenMPI. Will use OSU benchmark to demonstrate GPU to GPU communiction:

module load hip
module load openmpi

wget http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-5.9.tar.gz
tar xf osu-micro-benchmarks-5.9.tar.gz
cd osu-micro-benchmarks-5.9/

./configure --enable-rocm --with-rocm=$(hipconfig -p) CC=mpicc CXX=mpicxx LDFLAGS="-L$(ompi_info | grep Prefix: | cut -d': ' -f2)/lib/ -lmpi -L$(hipconfig -p)/lib/ $(hipconfig -C) -lamdhip64" CPPFLAGS="-std=c++11"
make

Then run OSU benchmark:

mpirun -np 2 --mca pml ucx -x UCX_TLS=self,sm,rocm mpi/pt2pt/osu_bw -d rocm D D

The benchmark will report GPU to GPU bandwidth.

See also https://github.com/openucx/ucx/wiki/Build-and-run-ROCM-UCX-OpenMPI

FAQ

Passing environment variables to nodes

While some batch schedulers (e.g. Slurm) transparently pass environment variables from the head node shell to all execution nodes given to mpirun. OAR does not (OAR provides no more than what OpenSSH does, be it when using oarsh or ssh as the connector). Thus OAR leaves this responsibility of environment variables passing to mpirun.

Therefore, in order to have more than the default environment variables (OMPI_* variables) passed/set on execution nodes, one has different options:

use the -x VAR option of mpirun, possibly for each variable to pass
Example:
mpirun -machinefile $OAR_NODE_FILE -x MY_ENV1 -x MY_ENV2 -x MY_ENV3="value3" ~/bin/mpi_test
use the --mca mca_base_env_list "ENV[;...]" option of mpirun
Example:
mpirun -machinefile $OAR_NODE_FILE --mca mca_base_env_list "MY_ENV1;MY_ENV2;MY_ENV3=value3" ~/bin/mpi_test
set the mca_base_env_list "ENV[;...]" option in the ~/.openmpi/mca-params.conf file
This way passing variable become transparent to the mpirun command line, which becomes:
mpirun -machinefile $OAR_NODE_FILE ~/bin/mpi_test
Rq
-x and --mca mca_base_env_list cannot coexist.

This could especially be useful to pass OpenMP variables, such as OMP_NUM_THREADS.

More info in OpenMPI manual pages.