Run MPI On Grid'5000: Difference between revisions
Line 251: | Line 251: | ||
= FAQ = | = FAQ = | ||
== Passing environment variables to nodes == | == Passing environment variables to nodes == | ||
While some batch schedulers (e.g. Slurm) do some tricks to transparently pass | While some batch schedulers (e.g. Slurm) do some tricks to transparently pass environment variables from the head node shell to all execution nodes given to <code class=command>mpirun</code>, OAR does not (OAR provides no more than what OpenSSH does, be it used directly when <code class=command>oarsub</code> is called with <code class=command>-t all_classic_ssh</code> or through the <code class=command>oarsh</code> wrapper). Thus it is let to <code class=command>mpirun</code> to do the job, and it knows how ! | ||
Therefore, in order to have more than the default environment variables (<code class=replace>OMPI_*</code> variables) passed/set on execution nodes, one have different options: | Therefore, in order to have more than the default environment variables (<code class=replace>OMPI_*</code> variables) passed/set on execution nodes, one have different options: |
Revision as of 12:50, 17 September 2021
![]() |
Note |
---|---|
This page is actively maintained by the Grid'5000 team. If you encounter problems, please report them (see the Support page). Additionally, as it is a wiki page, you are free to make minor corrections yourself if needed. If you would like to suggest a more fundamental change, please contact the Grid'5000 team. |
Introduction
MPI is a programming interface that enables the communication between processes of a distributed memory system. This tutorial focuses on setting up MPI environments on Grid'5000 and only requires a basic understanding of MPI concepts. For instance, you should know that standard MPI processes live in their own memory space and communicate with other processes by calling library routines to send and receive messages. For a comprehensive tutorials on MPI, see the IDRIS course on MPI. There are several freely-available implementations of MPI, including Open MPI, MPICH2, MPICH, LAM, etc. In this practical session, we focus on the Open MPI implementation.
Before following this tutorial you should already have some basic knowledge of OAR (see the Getting Started tutorial) . For the second part of this tutorial, you should also know the basics about OARGRID (see the Advanced OAR tutorial).
Running MPI on Grid'5000
When attempting to run MPI on Grid'5000 you will face a number of challenges, ranging from classical setup problems for MPI software to problems specific to Grid'5000. This practical session aims at driving you through the most common use cases, which are:
- Setting up and starting Open MPI on a default environment using
oarsh
. - Setting up and starting Open MPI on a default environment using the
allow_classic_ssh
option. - Setting up and starting Open MPI to use high performance interconnect.
- Setting up and starting latest Open MPI library version.
- Setting up and starting Open MPI to run on several sites using
oargridsub
.
Using Open MPI on a default environment
The default Grid'5000 environment provides Open MPI 3.1.3 (see ompi_info
).
Creating a sample MPI program
For the purposes of this tutorial, we create a simple MPI program where the MPI process of rank 0 broadcasts an integer (42) to all the other processes. Then, each process prints its rank, the total number of processes and the value he received from the process 0.
In your home directory, create a file ~/mpi/tp.c
and copy the source code:
#include <stdio.h>
#include <mpi.h>
#include <time.h> /* for the work function only */
#include <unistd.h>
int main (int argc, char *argv []) {
char hostname[257];
int size, rank;
int bcast_value = 1;
gethostname(hostname, sizeof hostname);
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if (!rank) {
bcast_value = 42;
}
MPI_Bcast(&bcast_value,1 ,MPI_INT, 0, MPI_COMM_WORLD );
printf("%s\t- %d - %d - %d\n", hostname, rank, size, bcast_value);
fflush(stdout);
MPI_Barrier(MPI_COMM_WORLD);
MPI_Finalize();
return 0;
}
You can then compile your code:
Setting up and starting Open MPI on a default environment using oarsh
Submit a job:
oarsh
is the remote shell connector of the OAR batch scheduler. It is a wrapper around the ssh
command that handles the configuration of the SSH environment. You can connect to the reserved nodes using oarsh
from the submission frontal of the cluster or from any node. As Open MPI defaults to using ssh
for remote startup of processes, you need to add the option --mca orte_rsh_agent "oarsh"
to your mpirun
command line. Open MPI will then use oarsh
in place of ssh
.
You can also set an environment variable (usually in your .bashrc):
Open MPI also provides a configuration file for --mca
parameters. In your home directory, create a file as ~/.openmpi/mca-params.conf
orte_rsh_agent=oarsh filem_rsh_agent=oarcp
You should have something like:
helios-52 - 4 - 12 - 42 helios-51 - 0 - 12 - 42 helios-52 - 5 - 12 - 42 helios-51 - 2 - 12 - 42 helios-52 - 6 - 12 - 42 helios-51 - 1 - 12 - 42 helios-51 - 3 - 12 - 42 helios-52 - 7 - 12 - 42 helios-53 - 8 - 12 - 42 helios-53 - 9 - 12 - 42 helios-53 - 10 - 12 - 42 helios-53 - 11 - 12 - 42
You may have (lot's of) warning messages if Open MPI cannot take advantage of any high performance hardware. At this point of the tutorial, this is not important as we will learn how to select clusters with high performance interconnect in greater details below. Error messages might look like this:
[[2616,1],2]: A high-performance Open MPI point-to-point messaging module was unable to find any relevant network interfaces: Module: OpenFabrics (openib) Host: helios-8.sophia.grid5000.fr Another transport will be used instead, although this may result in lower performance. -------------------------------------------------------------------------- warning:regcache incompatible with malloc warning:regcache incompatible with malloc warning:regcache incompatible with malloc
or like this:
[griffon-80.nancy.grid5000.fr:04866] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_mtl_mx: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [griffon-80.nancy.grid5000.fr:04866] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_btl_mx: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [griffon-80.nancy.grid5000.fr:04865] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_mtl_mx: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [griffon-80.nancy.grid5000.fr:04867] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_mtl_mx: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) ...
You can avoid those warnings by using the following options:
![]() |
node :
|
mpirun --mca orte_rsh_agent "oarsh" --mca btl openib,sm,self --mca pml ^cm -machinefile $OAR_NODEFILE $HOME/mpi_programm |
- For other clusters, you may use the following options:
- --mca pml ob1 --mca btl tcp,self
- --mca btl ^openib
- --mca btl ^mx
Setting up and starting Open MPI on a default environment using allow_classic_ssh
If you prefer using ssh
as a connector instead of oarsh
, submit a job with the allow_classic_ssh
type:
Launch your parallel job:
Setting up and starting Open MPI to use high performance interconnect
By default, Open MPI tries to use any high performance interconnect (e.g. Infiniband, Omni-Path) it can find. Options are available to either select or disable an interconnect:
MCA parameters (--mca) can be used to select the drivers that are used at run-time by Open MPI. To learn more about the MCA parameters, see also:
- The Open MPI FAQ about tuning parameters
- How do I tell Open MPI which IP interfaces / networks to use?
- The Open MPI documentation about OpenFabrics (ie: Infiniband)
To learn more about specific Omni-Path tools, refer to this page.
If you want to disable native support for high performance networks, use --mca btl self,sm,tcp --mca mtl ^psm2,ofi. The first part disables the openib backend, and the second part disables the psm2 backend (used by Omni-Path). This will switch to TCP backend of Open MPI.
Nodes with Infiniband or Omni-Path interfaces also provide an IP over Infiniband interface (these interfaces are named ibX), and can still be used by the TCP backend. To also disable their use, use --mca btl_tcp_if_exclude ib0,lo or select a specific interface with --mca btl_tcp_if_include eno2. You will ensure that 'regular' Ethernet interface is used.
We will be using OSU micro benchmark to check the performances of high performance interconnects.
To download, extract and compile our benchmark, do:
As we will benchmark two MPI processes, reserve only one core in two distinct nodes. If your reservation includes more resources, you will have to create a MPI machinefile file with only two entries, such as follow:
Infiniband hardware is available on several sites. For example, you will find clusters with Infiniband interconnect in Rennes (20G), Nancy (20G) and Grenoble (20G & 40G).
To reserve one core of two distinct nodes with:
- a 20G InfiniBand interconnect (DDR, Double Data Rate):
- a 40G InfiniBand interconnect (QDR, Quad Data Rate):
To check if the support for InfiniBand is available in Open MPI, run:
you should see something like this:
MCA btl: openib (MCA v2.1.0, API v3.0.0, Component v3.1.3)
To start the network benchmark, use:
Specificity of uva and uvb cluster on Sophia
uva and uvb are connected to a 40G QDR Infiniband network. But this network uses a partition key : PKEY
As a result, if you want to use mpi with the infiniband network, you have to specify the partition key (which is 8100) with the following option --mca btl_openib_pkey "0x8100"
Use a newer Open MPI version using modules
If you need latest Open MPI version you should use the module command.
----------------------- /grid5000/spack/share/spack/modules/linux-debian9-x86_64 ----------------------- [...] openmpi/3.1.3_gcc-6.4.0 openmpi/4.0.1_gcc-6.4.0 [...]
You must recompile simple MPI example on the frontend with this new version.
After you must submit a job and use the same Open MPI library version on the node.
The last step is tu run simple example. As the module environment through a ssh connection is lost we use $(which mpirun)
command.
More advanced use cases
Running MPI on several sites at once
In this section, we are going to execute a MPI process over several Grid'5000 sites. In this example we will use the following sites: Rennes, Sophia and Grenoble, using oargrid for making the reservation (see the Advanced OAR tutorial for more information).
The MPI program must be available on each site you want to use. From the frontend of one site, copy the mpi/ directory to the two other sites. You can do that with rsync. Suppose that you are connected in Sophia and that you want to copy Sophia's mpi/ directoy to Grenoble and Rennes.
(you can also add the --delete option to remove extraneous files from the mpi directory of Nancy and Grenoble).
Reserve nodes in each site from any frontend with oargridsub (you can also add options to reserve nodes from specific clusters if you want to):
![]() |
frontend :
|
oargridsub -w 02:00:00 nancy :rdef="nodes=2",grenoble :rdef="nodes=2",sophia :rdef="nodes=2" > oargrid.out |
Get the oargrid Id and Job key from the output of oargridsub:
Get the node list using oargridstat and copy the list to the first node:
Connect to the first node:
And run your MPI application:
FAQ
Passing environment variables to nodes
While some batch schedulers (e.g. Slurm) do some tricks to transparently pass environment variables from the head node shell to all execution nodes given to mpirun
, OAR does not (OAR provides no more than what OpenSSH does, be it used directly when oarsub
is called with -t all_classic_ssh
or through the oarsh
wrapper). Thus it is let to mpirun
to do the job, and it knows how !
Therefore, in order to have more than the default environment variables (OMPI_*
variables) passed/set on execution nodes, one have different options:
- use the
-x
VAR
option ofmpirun
, possibly for each variable to pass (WARNING, -x is depreacted) - Example:
mpirun -machinefile $OAR_NODE_FILE --mca orte_rsh_agent "oarsh" -x MY_ENV1 -x MY_ENV2 -x MY_ENV3="value3" ~/bin/mpi_test
- use the
--mca mca_base_env_list "ENV[;...]"
option ofmpirun
- Example:
mpirun -machinefile $OAR_NODE_FILE --mca orte_rsh_agent "oarsh" --mca mca_base_env_list "MY_ENV1;MY_ENV2;MY_ENV3=value3" ~/bin/mpi_test
- set the
mca_base_env_list "ENV[;...]"
option in the~/.openmpi/mca-params.conf
file - This way passing variable become transparent to the
mpirun
command line, which becomes:
mpirun -machinefile $OAR_NODE_FILE --mca orte_rsh_agent "oarsh" ~/bin/mpi_test
Rq:
orte_rsh_agent="oarsh"
can be set in the~/.openmpi/mca-params.conf
configuration file as well (but only if usingoarsh
as the connector)-x
and--mca mca_base_env_list
cannot coexist.
This could especially be useful to pass OpenMP variables, such as OMP_NUM_THREADS.
More info in OpenMPI manual pages.