Run MPI On Grid'5000
Note | |
---|---|
This page is actively maintained by the Grid'5000 team. If you encounter problems, please report them (see the Support page). Additionally, as it is a wiki page, you are free to make minor corrections yourself if needed. If you would like to suggest a more fundamental change, please contact the Grid'5000 team. |
Introduction
MPI is a programming interface that enables the communication between processes of a distributed memory system. This tutorial focuses on setting up MPI environments on Grid'5000 and only requires a basic understanding of MPI concepts. For instance, you should know that standard MPI processes live in their own memory space and communicate with other processes by calling library routines to send and receive messages. For a comprehensive tutorials on MPI, see the IDRIS course on MPI. There are several freely-available implementations of MPI, including Open MPI, MPICH2, MPICH, LAM, etc. In this practical session, we focus on the Open MPI implementation.
Before following this tutorial you should already have some basic knowledge of OAR (see the Getting Started tutorial). For the second part of this tutorial, you should also know the basics about Multi-site reservation (see the Advanced OAR tutorial).
Running MPI on Grid'5000
When attempting to run MPI on Grid'5000 you will face a number of challenges, ranging from classical setup problems for MPI software to problems specific to Grid'5000. This practical session aims at driving you through the most common use cases, which are:
- Setting up and starting Open MPI on a default environment.
- Setting up and starting Open MPI to use high performance interconnect.
- Setting up and starting latest Open MPI library version.
- Setting up and starting Open MPI to run on several sites using
funk
.
Using Open MPI on a default environment
The default Grid'5000 environment provides Open MPI 4.1.0 (see ompi_info
).
Creating a sample MPI program
For the purposes of this tutorial, we create a simple MPI program where the MPI process of rank 0 broadcasts an integer (42) to all the other processes. Then, each process prints its rank, the total number of processes and the value it received from the process 0.
In your home directory, create a file ~/mpi/tp.c
and copy the source code:
#include <stdio.h>
#include <mpi.h>
#include <time.h> /* for the work function only */
#include <unistd.h>
int main (int argc, char *argv []) {
char hostname[257];
int size, rank;
int bcast_value = 1;
gethostname(hostname, sizeof hostname);
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if (!rank) {
bcast_value = 42;
}
MPI_Bcast(&bcast_value,1 ,MPI_INT, 0, MPI_COMM_WORLD );
printf("%s\t- %d - %d - %d\n", hostname, rank, size, bcast_value);
fflush(stdout);
MPI_Barrier(MPI_COMM_WORLD);
MPI_Finalize();
return 0;
}
You can then compile your code:
Setting up and starting Open MPI on a default environment
Submit a job:
OAR provides the $OAR_NODEFILE file, containing the list of nodes of the job (one line per CPU core). You can use it to launch your parallel job:
You should have something like:
helios-52 - 4 - 12 - 42 helios-51 - 0 - 12 - 42 helios-52 - 5 - 12 - 42 helios-51 - 2 - 12 - 42 helios-52 - 6 - 12 - 42 helios-51 - 1 - 12 - 42 helios-51 - 3 - 12 - 42 helios-52 - 7 - 12 - 42 helios-53 - 8 - 12 - 42 helios-53 - 9 - 12 - 42 helios-53 - 10 - 12 - 42 helios-53 - 11 - 12 - 42
You may have (lot's of) warning messages if Open MPI cannot take advantage of any high performance hardware. At this point of the tutorial, this is not important as we will learn how to select clusters with high performance interconnect in greater details below. Error messages might look like this:
[1637577186.512697] [taurus-10:2765 :0] rdmacm_cm.c:638 UCX ERROR rdma_create_event_channel failed: No such device [1637577186.512752] [taurus-10:2765 :0] ucp_worker.c:1432 UCX ERROR failed to open CM on component rdmacm with status Input/output error [taurus-10.lyon.grid5000.fr:02765] ../../../../../../ompi/mca/pml/ucx/pml_ucx.c:273 Error: Failed to create UCP worker
To tell OpenMPI not try to use high performance hardware and avoid those warnings message, use the following options:
Setting up and starting Open MPI to use high performance interconnect
Open MPI provides several alternative compononents to use High Performance Interconnect hardware (such as Infiniband or Omni-Path).
MCA parameters (--mca) can be used to select the component that are used at run-time by OpenMPI. To learn more about the MCA parameters, see also:
- The Open MPI FAQ about tuning parameters
- How do I tell Open MPI which IP interfaces / networks to use?
- The Open MPI documentation about OpenFabrics (ie: Infiniband)
- The Open MPI documention about Omni-path. Intel documentation about Omni-Path tools.
Open MPI packaged in Grid'5000 debian11 includes ucx, ofi and openib components to make use of Infiniband network, and psm2 and ofi to make use of Omnipath network.
If you want some of these components, you can for example use --mca pml ^ucx --mca mtl ^psm2,ofi --mca btl ^ofi,openib. This disables all high performance components mentionned above and will force Open MPI to use its TCP backend.
Infiniband network
By default, OpenMPI tries to use Infiniband high performance interconnect using the UCX component. When Infiniband network is available, it will provide best results in most of cases (UCX even uses multiple interfaces at a time when available).
Omni-Path network
For Open MPI to work with Omni-Path network hardware, PSM2 component must be used. It is recommended to explicitly disable other components (otherwise, OpenMPI will try to load other components which may produce error messages or provide lower performance):
node :
|
mpirun -machinefile $OAR_NODEFILE -mca mtl psm2 -mca pml ^ucx,ofi -mca btl ^ofi,openib ~/mpi/tp |
IP over Infiniband or Omni-Path
Nodes with Infiniband or Omni-Path network interfaces also provide an IP over Infiniband interface (these interfaces are named ibX). The TCP backend of Open MPI will try to use them by default.
You can explicitely select interfaces used by the TCP backend using for instance --mca btl_tcp_if_exclude ib0,lo (to avoid using IP over Infiniband and local interfaces) or --mca btl_tcp_if_include eno2 (to force using the 'regular' Ethernet interface eno2).
Benckmarking
We will be using OSU micro benchmark to check the performances of high performance interconnects.
To download, extract and compile our benchmark, do:
frontend :
|
cd ~/mpi
make |
As we will benchmark two MPI processes, reserve only one core in two distinct nodes.
To start the network benchmark, use:
node :
|
mpirun mpirun -machinefile $OAR_NODEFILE -npernode 1 ~/mpi/osu-micro-benchmarks-5.8/mpi/pt2pt/osu_latency |
The option `-npernode 1` tells to only spawn one process on each node, as the benchmark requires only two processes to communicate.
You can then try to compare performance of various network hardware available on Grid'5000. See for instance Network section of Hardware page.
OAR can select nodes according to properties related to network performance. For example:
- To reserve one core of two distinct nodes with a 56Gbps InfiniBand interconnect:
- To reserve one core of two distinct nodes with a 100Gbps Omni-Path interconnect:
Use a newer Open MPI version using modules
Newer Open MPI versions are available as module. To list available versions, use this command.
[...] openmpi/4.1.4_gcc-10.4.0 [...]
This will load chosen Open MPI to execution your environment:
You must recompile simple MPI example on the frontend with this new version.
From your job, you must ensure the same Open MPI version is used on every nodes:
Note that $(which mpirun)
command is used in this last step to ensure mpirun from the module environment is used.
Installing Open MPI in a conda environment
Here's an example of installing Open MPI in a conda environment optimized with ucx to use the cluster's high-bandwidth and low-latency.
We work on Grenoble site
UCX exposes a set of abstract communication primitives that utilize the best of available hardware resources and offloads. These include RDMA (InfiniBand and RoCE), TCP, GPUs, shared memory, and network atomic operations.
- Installation
- Install OpenMPI, ucx, and GCC from conda-forge channel in the
<env_name>
environment
fgrenoble :
|
conda create -y -n <env_name>
conda install -c conda-forge gcc_linux-64 openmpi ucx |
Note: do not forget to create a dedicated environment before.
- Test installation via NetPIPE
- Install NetPIPE to test latency and network throughput (NetPIPE not available as conda package)
fgrenoble :
|
cd $HOME ; mkdir SRC && cd SRC
cd NetPIPE-3.7.2/ && make mpi |
- Reserve 2 cores on 2 separate nodes and enter into interactive session (ex: cluster dahu in Grenoble) :
Note: choose an appropriate cluster with OminPath or InfiniBand Network connection to compare performance between two nodes using or not ucx driver. See Grid'5000 Hardware Documentation.
- On node dahu-X : Load conda, activate your conda environment, modify $PATH
- Run MPI without ucx (use standard network):
dahu :
|
mpirun -np 2 --machinefile $OAR_NODEFILE --prefix $CONDA_PREFIX --mca plm_rsh_agent oarsh NPmpi |
0: dahu-3 1: dahu-30 Now starting the main loop 0: 1 bytes 6400 times --> 0.54 Mbps in 14.21 usec 1: 2 bytes 7035 times --> 1.07 Mbps in 14.20 usec 2: 3 bytes 7043 times --> 1.61 Mbps in 14.22 usec ... 116: 4194304 bytes 12 times --> 8207.76 Mbps in 3898.75 usec 117: 4194307 bytes 12 times --> 8161.45 Mbps in 3920.87 usec ...
- Run MPI with ucx (use rapid network):
dahu :
|
mpirun -np 2 --machinefile $OAR_NODEFILE --prefix $CONDA_PREFIX --mca plm_rsh_agent oarsh --mca pml ucx --mca osc ucx NPmpi |
0: dahu-3 1: dahu-30 Now starting the main loop 0: 1 bytes 19082 times --> 1.69 Mbps in 4.50 usec 1: 2 bytes 22201 times --> 3.08 Mbps in 4.95 usec 2: 3 bytes 20212 times --> 4.46 Mbps in 5.13 usec ... 116: 4194304 bytes 46 times --> 30015.10 Mbps in 1066.13 usec 117: 4194307 bytes 46 times --> 30023.66 Mbps in 1065.83 usec ...
Using other MPI implementations
Both MPICH and MVAPICH are available as modules on Grid5000. Both have the same architecture when it comes to managing networks: they can both be compiled with UCX *or* with OFI, and therefore there are two versions of the modules available:
- mpich/{version...}-ofi, which should be used on OmniPath interconnects. - mpich/{version...}-ucx, which should be used for infiniband interconnects.
When in doubt, prefer the OFI version as it offers better performances when using the TCP modules. The MPI implementation should automatically select the "best" network available and use it, but if at some point you want to explicitly set which transport layer to use, there are ways to do so for both OFI and UCX.
Choosing the transport layer with OFI
OFI uses libfabric to handles the underlying networks, and you can use a couple of variables to filter/select the transport layer to use.
The first one is FI_PROVIDER
.
The command fi_info -l
should provide enough information about the fabrics available.
Typically, you can set it to psm2
to force using OmniPath, or tcp
to force communication to go through the tcp module (either over IP or over OmniPath depending on the interface used.
You can select the interface(s) to use for the tcp module by setting the FI_TCP_IFACE
variable.
Choosing the transport layer with UCX
Similarly to OFI, you can use an environment variable to tell UCX to use a specific transport layer: UCX_TLS
.
The command ucx_info -d
to get information about the available transport layer.
You can select the interface(s) to use by setting the UCX_NET_DEVICES
variable.
More advanced use cases
Running MPI on several sites at once
In this section, we are going to execute a MPI process over several Grid'5000 sites. In this example we will use the following sites: Rennes, Sophia and Grenoble, using oargrid for making the reservation (see the Advanced OAR tutorial for more information).
The MPI program must be available on each site you want to use. From the frontend of one site, copy the mpi/ directory to the two other sites. You can do that with rsync. Suppose that you are connected in Sophia and that you want to copy Sophia's mpi/ directoy to Grenoble and Rennes.
(you can also add the --delete option to remove extraneous files from the mpi directory of Nancy and Grenoble).
Reserve nodes in each site from any frontend with Funk (you can also add options to reserve nodes from specific clusters if you want to):
Get the node list using oarstat and copy the list to the first node:
Connect to the first node:
And run your MPI application:
MPI and GPU to GPU communications
Direct GPU to GPU exchange of data (without having to transit on system RAM) is available with OpenMPI. On Grid'5000, this feature is currently only supported for AMD GPUs on the same node (see bug #13559 and bug #13540 for progress on other use cases).
This feature is available using the module version of OpenMPI. Will use OSU benchmark to demonstrate GPU to GPU communiction:
module load hip module load openmpi wget http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-5.9.tar.gz tar xf osu-micro-benchmarks-5.9.tar.gz cd osu-micro-benchmarks-5.9/ ./configure --enable-rocm --with-rocm=$(hipconfig -p) CC=mpicc CXX=mpicxx LDFLAGS="-L$(ompi_info | grep Prefix: | cut -d': ' -f2)/lib/ -lmpi -L$(hipconfig -p)/lib/ $(hipconfig -C) -lamdhip64" CPPFLAGS="-std=c++11" make
Then run OSU benchmark:
mpirun -np 2 --mca pml ucx -x UCX_TLS=self,sm,rocm mpi/pt2pt/osu_bw -d rocm D D
The benchmark will report GPU to GPU bandwidth.
See also https://github.com/openucx/ucx/wiki/Build-and-run-ROCM-UCX-OpenMPI
FAQ
Passing environment variables to nodes
While some batch schedulers (e.g. Slurm) transparently pass environment variables from the head node shell to all execution nodes given to mpirun
. OAR does not (OAR provides no more than what OpenSSH does, be it when using oarsh
or ssh
as the connector). Thus OAR leaves this responsibility of environment variables passing to mpirun.
Therefore, in order to have more than the default environment variables (OMPI_*
variables) passed/set on execution nodes, one has different options:
- use the
-x
VAR
option ofmpirun
, possibly for each variable to pass - Example:
mpirun -machinefile $OAR_NODE_FILE -x MY_ENV1 -x MY_ENV2 -x MY_ENV3="value3" ~/bin/mpi_test
- use the
--mca mca_base_env_list "ENV[;...]"
option ofmpirun
- Example:
mpirun -machinefile $OAR_NODE_FILE --mca mca_base_env_list "MY_ENV1;MY_ENV2;MY_ENV3=value3" ~/bin/mpi_test
- set the
mca_base_env_list "ENV[;...]"
option in the~/.openmpi/mca-params.conf
file - This way passing variable become transparent to the
mpirun
command line, which becomes:
mpirun -machinefile $OAR_NODE_FILE ~/bin/mpi_test
- Rq
-x
and--mca mca_base_env_list
cannot coexist.
This could especially be useful to pass OpenMP variables, such as OMP_NUM_THREADS.
More info in OpenMPI manual pages.