Run MPI On Grid'5000: Difference between revisions

From Grid5000
Jump to navigation Jump to search
No edit summary
No edit summary
Line 20: Line 20:
* Setting up and starting Open MPI to use high performance interconnect.
* Setting up and starting Open MPI to use high performance interconnect.
* Setting up and starting Open MPI to run on several sites using <code class='command'>oargridsub</code>.
* Setting up and starting Open MPI to run on several sites using <code class='command'>oargridsub</code>.
* Setting up and starting Open MPI in a kadeploy image.
* Setting up and starting your own Open MPI library inside a kadeploy image.


= Using Open MPI on a default environment =
= Using Open MPI on a default environment =
Line 71: Line 71:
<code class=command>oarsh</code> is the remote shell connector of the OAR batch scheduler. It is a wrapper around the <code class=command>ssh</code> command that handles the configuration of the SSH environment. You can connect to the reserved nodes using <code class=command>oarsh</code> from the submission frontal of the cluster or from any node. As Open MPI defaults to using <code class=command>ssh</code> for remote startup of processes, you need to add the option <code class=command>--mca orte_rsh_agent "oarsh"</code> to your <code class=command>mpirun</code> command line. Open MPI will then use <code class=command>oarsh</code> in place of <code class=command>ssh</code>.
<code class=command>oarsh</code> is the remote shell connector of the OAR batch scheduler. It is a wrapper around the <code class=command>ssh</code> command that handles the configuration of the SSH environment. You can connect to the reserved nodes using <code class=command>oarsh</code> from the submission frontal of the cluster or from any node. As Open MPI defaults to using <code class=command>ssh</code> for remote startup of processes, you need to add the option <code class=command>--mca orte_rsh_agent "oarsh"</code> to your <code class=command>mpirun</code> command line. Open MPI will then use <code class=command>oarsh</code> in place of <code class=command>ssh</code>.


{{Note|text=For Debian Wheezy, uses '''plm_rsh_agent''' instead of '''orte_rsh_agent'''}}
{{Note|text=Debian Wheezy uses '''plm_rsh_agent''' instead of '''orte_rsh_agent'''}}


{{Term|location=node|cmd=<code class="command">mpirun</code> --mca orte_rsh_agent "oarsh" -machinefile $OAR_NODEFILE ~/mpi/tp}}
{{Term|location=node|cmd=<code class="command">mpirun</code> --mca orte_rsh_agent "oarsh" -machinefile $OAR_NODEFILE ~/mpi/tp}}
Line 123: Line 123:


==Setting up and starting Open MPI on a default environment using allow_classic_ssh==
==Setting up and starting Open MPI on a default environment using allow_classic_ssh==
When your reservation only includes entire nodes (i.e. you are not making reservations at the core level), you can use <code class="command">ssh</code> as a connector instead of <code class="command">oarsh</code>.


Submit a job with the <code>allow_classic_ssh</code> type:
If you prefer using <code class="command">ssh</code> as a connector instead of <code class="command">oarsh</code>, submit a job with the <code>allow_classic_ssh</code> type:
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -t allow_classic_ssh -l nodes=3}}
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -t allow_classic_ssh -l nodes=3}}


Line 131: Line 130:
{{Term|location=node|cmd=<code class="command">mpirun</code> -machinefile $OAR_NODEFILE ~/mpi/tp}}
{{Term|location=node|cmd=<code class="command">mpirun</code> -machinefile $OAR_NODEFILE ~/mpi/tp}}


{{Note|text=The <code class=command>oarsh</code> connector uses the Linux control group '''cpuset''' (resources confinement) mechanism to restrict the jobs on assigned resources. Therefore, the <code>allow_classic_ssh</code> option cannot be used when nodes are shared between users (i.e. for reservations at the core level).}}
{{Note|text=Using <code>allow_classic_ssh</code> option avoids OAR resources confinement mechanism with '''cpuset''' to restrict the jobs on assigned resources. Therefore, <code>allow_classic_ssh</code> cannot be used with jobs sharing nodes between users (i.e. for reservations at the core level).}}


== Setting up and starting Open MPI to use high performance interconnect ==
== Setting up and starting Open MPI to use high performance interconnect ==
By default, Open MPI tries to use any high performance interconnect it can find. But it works only if the related libraries were found during the compilation of Open Mpi (not during the compilation of your application). It should work if you built Open MPI on a jessie-x64 environment, and it also works correctly on the default environment.
By default, Open MPI tries to use any high performance interconnect (e.g. Infiniband) it can find. Options are available to either select or disable an interconnect:
 
Options can be used to either select or disable an interconnect.


MCA parameters ('''--mca''') can be used to select the drivers that are used at run-time by Open MPI. To learn more about the MCA parameters, see also:
MCA parameters ('''--mca''') can be used to select the drivers that are used at run-time by Open MPI. To learn more about the MCA parameters, see also:
Line 144: Line 141:
* [https://www.open-mpi.org/faq/?category=myrinet  The Open MPI documentation] about [https://en.wikipedia.org/wiki/Myrinet Myrinet]
* [https://www.open-mpi.org/faq/?category=myrinet  The Open MPI documentation] about [https://en.wikipedia.org/wiki/Myrinet Myrinet]


{{Note|text=If you want to disable support for high performance networks, use '''--mca btl self,sm,tcp'''. This will switch to TCP but if IPoverIB is available, Infiniband will still be used. To also disable IP emulation of high performance interconnect, use '''--mca btl_tcp_if_exclude ib0,lo,myri0''' or select a specific interface with '''--mca btl_tcp_if_include eth1'''.}}
If you want to disable native support for high performance networks, use '''--mca btl self,sm,tcp'''. This will switch to TCP backend of Open MPI.
 
Nodes with Infiniband interfaces also provide an '''IP over Infiniband''' interface (these interfaces are named '''ibX'''), and can still be used by TCP backend. To also disable their use, use '''--mca btl_tcp_if_exclude ib0,lo,myri0''' or select a specific interface with '''--mca btl_tcp_if_include eth1'''. You will ensure that 'regular' Ethernet interface is used.


We will be using [http://pkgs.fedoraproject.org/repo/pkgs/NetPIPE/NetPIPE-3.7.1.tar.gz/5f720541387be065afdefc81d438b712/NetPIPE-3.7.1.tar.gz NetPIPE] to check the performances of high performance interconnects.
We will be using [http://pkgs.fedoraproject.org/repo/pkgs/NetPIPE/NetPIPE-3.7.1.tar.gz/5f720541387be065afdefc81d438b712/NetPIPE-3.7.1.tar.gz NetPIPE] to check the performances of high performance interconnects.
Line 155: Line 154:
{{Term|location=frontend|cmd=<code class="command">make</code> mpi}}
{{Term|location=frontend|cmd=<code class="command">make</code> mpi}}


As NetPipe only works between two MPI processes, we will reserve one core of two distinct nodes. If your reservation includes more resources, you will have to create a MPI machinefile file ('''--machinefile''') with only two entries as follows:
As NetPipe only works between two MPI processes, reserve only one core in two distinct nodes. If your reservation includes more resources, you will have to create a MPI machinefile file with only two entries, such as follow:
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -l nodes=2}}
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -l nodes=2}}
{{Term|location=node|cmd=<code class="command">uniq $OAR_NODEFILE &#124; head -n 2 > /tmp/machinefile</code>}}
{{Term|location=node|cmd=<code class="command">uniq $OAR_NODEFILE &#124; head -n 2 > /tmp/machinefile</code>}}


Infiniband hardware is available on several sites. For example, you will find clusters with Infiniband interconnect in Rennes (20G), Nancy (20G) and Grenoble (20G & 40G). Myrinet hardware is available at Lille (10G) (see [https://www.grid5000.fr/mediawiki/index.php/Special:G5KHardware#High_performance_network_families Hardware page]).
Infiniband hardware is available on several sites. For example, you will find clusters with Infiniband interconnect in Rennes (20G), Nancy (20G) and Grenoble (20G & 40G). Myrinet hardware (10G) is available at Lille and Sophia, but is not supported by the current version of operating system used in Grid'5000 (Debian Jessie) (see [https://www.grid5000.fr/mediawiki/index.php/Special:G5KHardware#High_performance_network_families Hardware page]).


To reserve one core of two distinct nodes with:
To reserve one core of two distinct nodes with:
Line 166: Line 165:
* a 40G InfiniBand interconnect (QDR, Quad Data Rate):
* a 40G InfiniBand interconnect (QDR, Quad Data Rate):
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -l /nodes=2/core=1 -p "ib40g='YES'"}}
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -l /nodes=2/core=1 -p "ib40g='YES'"}}
<!--
* a 10G Myrinet interconnect:
* a 10G Myrinet interconnect:
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -l /nodes=2/core=1 -p "myri10g='YES'"}}
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -l /nodes=2/core=1 -p "myri10g='YES'"}}
To test the network:
-->
{{Term|location=node|cmd=<code class="command">cd</code> ~/mpi/NetPIPE-3.7.1}}
{{Term|location=node|cmd=<code class="command">mpirun</code> --mca  orte_rsh_agent "oarsh" -machinefile $OAR_NODEFILE NPmpi}}


To check if the support for InfiniBand is available in Open MPI, run:
To check if the support for InfiniBand is available in Open MPI, run:
Line 176: Line 174:
you should see something like this:
you should see something like this:
                 MCA btl: openib (MCA v2.0, API v2.0, Component v1.4.1)
                 MCA btl: openib (MCA v2.0, API v2.0, Component v1.4.1)
<!--
To check if the support for Myrinet is available in Open MPI, run:
To check if the support for Myrinet is available in Open MPI, run:
{{Term|location=node|cmd=<code class="command">ompi_info </code> &#124; grep mx}}
{{Term|location=node|cmd=<code class="command">ompi_info </code> &#124; grep mx}}
(if the output is empty, there is no builtin mx support).
(if the output is empty, there is no builtin mx support).
-->
To start the network benchmark, use:
{{Term|location=node|cmd=<code class="command">cd</code> ~/mpi/NetPIPE-3.7.1}}
{{Term|location=node|cmd=<code class="command">mpirun</code> --mca  orte_rsh_agent "oarsh" -machinefile $OAR_NODEFILE NPmpi}}


Without high performance interconnect, results looks like this:
Without high performance interconnect, results looks like this:
Line 188: Line 192:
The latency is given by the last column for a 1 byte message; the maximum throughput is given by the last line (896.17 Mbps in that case).  
The latency is given by the last column for a 1 byte message; the maximum throughput is given by the last line (896.17 Mbps in that case).  


<!--
With a Myrinet2G network, a typical result looks like this:
With a Myrinet2G network, a typical result looks like this:
   0:      1 bytes  23865 times -->      2.03 Mbps in      3.77 usec     
   0:      1 bytes  23865 times -- >      2.03 Mbps in      3.77 usec     
   1:      2 bytes  26549 times -->      4.05 Mbps in      3.77 usec     
   1:      2 bytes  26549 times -- >      4.05 Mbps in      3.77 usec     
  ...
  ...
  122: 8388608 bytes      3 times -->  1773.88 Mbps in  36079.17 usec
  122: 8388608 bytes      3 times -- >  1773.88 Mbps in  36079.17 usec
  123: 8388611 bytes      3 times -->  1773.56 Mbps in  36085.69 usec
  123: 8388611 bytes      3 times -- >  1773.56 Mbps in  36085.69 usec
In this example, we have 3.77 ms of latency and almost 1.8 Gbit/s of bandwitdh.
In this example, we have 3.77 ms of latency and almost 1.8 Gbit/s of bandwitdh.
-->


With Infiniband 40G (QDR), you should have much better performance that using Ethernet or Myrinet 2G or Infiniband 20G (DDR):
With Infiniband 40G (QDR), you should have much better performance that using Ethernet:
   0:      1 bytes  30716 times -->      4.53 Mbps in      1.68 usec
   0:      1 bytes  30716 times -->      4.53 Mbps in      1.68 usec
   1:      2 bytes  59389 times -->      9.10 Mbps in      1.68 usec
   1:      2 bytes  59389 times -->      9.10 Mbps in      1.68 usec
Line 210: Line 216:
== Running MPI on several sites at once ==
== Running MPI on several sites at once ==


In this tutorial, we use the following sites: Rennes, Sophia and Grenoble. For making a reservation on multiple sites, we will be using oargrid. See the [[Grid_jobs_management]] tutorial for more information.
In this section, we are going to execute a MPI process over several Grid'5000 sites. In this example we will use the following sites: Rennes, Sophia and Grenoble, using oargrid for making the reservation (see the [[Grid_jobs_management]] tutorial for more information).


{{Warning|text=There currently is an issue when using nodes of Lille and Luxembourgs in a same execution.}}
{{Warning|text=Open MPI tries to figure out the best network interface to use at run time. However, selected networks are not always "production" Grid'5000 network which is routed between sites. In addition, only TCP implementation will work between sites, as high performance networks are only available from inside a side. To ensure correct network is selected, add the option '''--mca opal_net_private_ipv4 "192.168.0.0/16" --mca btl_tcp_if_exclude ib0,lo,myri0 --mca btl self,sm,tcp''' to mpirun}}
 
{{Warning|text=Open MPI tries to figures out the best network interface at run time, and it also assumes that some networks are not routed between sites. To avoid that kind of problems, we must add the option '''--mca opal_net_private_ipv4 "192.168.160.0/24\;192.168.14.0/23" --mca btl_tcp_if_exclude ib0,lo,myri0 ''' to mpirun}}
 
{{Note|text=For multiple sites, we should only use TCP, and there is no MX or InfiniBand network between sites. Therefore, we add this option to mpirun: '''--mca btl self,sm,tcp'''}}


The MPI program must be available on each site you want to use. From the frontend of one site, copy the mpi/ directory to the two other sites. You can do that with '''rsync'''. Suppose that you are connected in Sophia and that you want to copy Sophia's mpi/ directoy to Grenoble and Rennes.
The MPI program must be available on each site you want to use. From the frontend of one site, copy the mpi/ directory to the two other sites. You can do that with '''rsync'''. Suppose that you are connected in Sophia and that you want to copy Sophia's mpi/ directoy to Grenoble and Rennes.
Line 222: Line 224:
{{Term|location=fsophia|cmd=<code class="command">rsync</code> -avz ~/mpi/ rennes.grid5000.fr:mpi/}}
{{Term|location=fsophia|cmd=<code class="command">rsync</code> -avz ~/mpi/ rennes.grid5000.fr:mpi/}}
{{Term|location=fsophia|cmd=<code class="command">rsync</code> -avz ~/mpi/ grenoble.grid5000.fr:mpi/}}
{{Term|location=fsophia|cmd=<code class="command">rsync</code> -avz ~/mpi/ grenoble.grid5000.fr:mpi/}}
(you can also add the ''--delete' option to remove extraneous files from the mpi directory of Rennes and Grenoble).
(you can also add the ''--delete'' option to remove extraneous files from the mpi directory of Rennes and Grenoble).


Reserve nodes in each site from any frontend with oargridsub (you can also add options to reserve nodes from specific clusters if you want to):
Reserve nodes in each site from any frontend with oargridsub (you can also add options to reserve nodes from specific clusters if you want to):
Line 236: Line 238:
And run your MPI application:
And run your MPI application:
{{Term|location=node|cmd=<code class="command">cd</code> ~/mpi/}}
{{Term|location=node|cmd=<code class="command">cd</code> ~/mpi/}}
{{Term|location=node|cmd=<code class="command">mpirun</code> -machinefile ~/gridnodes --mca orte_rsh_agent "oarsh" --mca opal_net_private_ipv4 "192.168.160.0/24\;192.168.14.0/23" --mca btl_tcp_if_exclude ib0,lo,myri0 --mca btl self,sm,tcp tp}}
{{Term|location=node|cmd=<code class="command">mpirun</code> -machinefile ~/gridnodes --mca orte_rsh_agent "oarsh" --mca opal_net_private_ipv4 "192.168.0.0/16" --mca btl_tcp_if_exclude ib0,lo,myri0 --mca btl self,sm,tcp tp}}


<!--
==Compilation of Open MPI ==
==Compilation of Open MPI ==


Line 250: Line 253:
{{Term|location=node|cmd=cd openmpi-1.10.2}}
{{Term|location=node|cmd=cd openmpi-1.10.2}}


Run '''configure''':
Run ''configure'':
{{Term|location=node|cmd=<code class="command">./configure</code> --prefix=$HOME/openmpi/  --with-memory-manager=none}}
{{Term|location=node|cmd=<code class="command">./configure</code> --prefix=$HOME/openmpi/  --with-memory-manager=none}}


Line 256: Line 259:
{{Term|location=node|cmd=<code class="command">make</code> -j8}}
{{Term|location=node|cmd=<code class="command">make</code> -j8}}


* Install it on your home directory (in $HOME/openmpi/ )
Install it on your home directory (in $HOME/openmpi/ )
{{Term|location=node|cmd=<code class="command">make install</code>}}
{{Term|location=node|cmd=<code class="command">make install</code>}}


Line 263: Line 266:


You should recompile your program before trying to use the new runtime environment.
You should recompile your program before trying to use the new runtime environment.
-->
== Make your own kadeploy image with latest Open MPI version ==


== Setting up and starting Open MPI in a kadeploy image ==
{{Warning|text=This part of the tutorial is [https://intranet.grid5000.fr/bugzilla/show_bug.cgi?id=5771 known to be buggy]. It is strongly recommended that you skip it, unless you have an important need for Myrinet networking -- however, be prepared to fix it if it is the case.}}
=== Building a kadeploy image ===
=== Building a kadeploy image ===
The default Open MPI version available in Debian based distributions is not compiled with libraries for high performance networks like myrinet/MX, therefore we must recompile Open MPI from sources if we want to use Myrinet networks. Fortunately, every default image (jessie-x64-XXX) but the min variant includes the libraries for high performance interconnects, and Open MPI will find them at compile time.
If you need latest Open MPI version, or use Open MPI with specific compilation options (like myrinet/MX support), you must recompile Open MPI from sources. In this section we are going to build an image that includes latest version of Open MPI built from sources. Note that you could also build and install this custom Open MPI in your home directory, without requiring deploying (i.e., using <code>./configure --prefix=$HOME/openmpi/</code>).  


We will create a kadeploy image based on an existing one.
This image will be based on jessie-x64-base. Let's deploy it:
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -t deploy -l nodes=1,walltime=2 }}
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -t deploy -l nodes=1,walltime=2 }}
{{Term|location=frontend|cmd=<code class="command">kadeploy3</code> -f $OAR_NODEFILE -e jessie-x64-base -k }}
{{Term|location=frontend|cmd=<code class="command">kadeploy3</code> -f $OAR_NODEFILE -e jessie-x64-base -k }}
Connect on the first node as root, and install openmpi:
Connect to reserved node as root:
{{Term|location=frontend|cmd=<code class="command">ssh root@</code>$(head -1 $OAR_NODEFILE)}}
{{Term|location=frontend|cmd=<code class="command">ssh root@</code>$(head -1 $OAR_NODEFILE)}}
Download Open MPI:
Download Open MPI sources from [http://www.open-mpi.org/software/ompi/ official website]:
{{Term|location=frontend|cmd=<code class="command">cd</code> /tmp/}}
{{Term|location=frontend|cmd=<code class="command">cd</code> /tmp/}}
{{Term|location=frontend|cmd=<code class="command">wget</code> http://www.open-mpi.org/software/ompi/v1.10/downloads/openmpi-1.10.2.tar.bz2}}
{{Term|location=frontend|cmd=<code class="command">wget</code> http://www.open-mpi.org/software/ompi/v1.10/downloads/openmpi-1.10.2.tar.bz2}}
{{Term|location=node|cmd=tar -xf openmpi-1.10.2.tar.bz2}}
{{Term|location=node|cmd=tar -xf openmpi-1.10.2.tar.bz2}}
{{Term|location=node|cmd=cd openmpi-1.10.2}}
{{Term|location=node|cmd=cd openmpi-1.10.2}}
Install g++,  make gfortran, f2c and the BLAS library:
Install build dependencies:
{{Term|location=node|cmd=<code class="command">apt-get</code> -y install g++ make gfortran f2c libblas-dev}}
{{Term|location=node|cmd=<code class="command">apt-get</code> -y install g++ make gfortran f2c libblas-dev}}
Configure and compile:
Configure, compile and install:
{{Term|location=node|cmd=./configure --libdir=/usr/local/lib64 --with-memory-manager=none}}
{{Term|location=node|cmd=./configure --libdir=/usr/local/lib64 --with-memory-manager=none}}
{{Term|location=node|cmd=make -j8}}
{{Term|location=node|cmd=make -j8}}
{{Term|location=node|cmd=make install}}
{{Term|location=node|cmd=make install}}


To run a MPI application, we will create a dedicated user named mpi. We add it to the '''rdma''' group for Infiniband. Also, we copy the ~root/authorized_keys files so that we can login as user '''mpi''' from the frontend. We also create an SSH key for identifying the '''mpi''' user (needed by Open MPI).  
To run our MPI applications, we create a dedicated user named '''mpi'''. We add it to the '''rdma''' group to allow access to Infiniband hardware. Also, we copy the ~root/authorized_keys files so that we can login as user '''mpi''' from the frontend. We also create an SSH key for identifying the '''mpi''' user (needed by Open MPI).


<pre class="brush: bash">
useradd -m -g rdma mpi -d /var/mpi
useradd -m -g rdma mpi -d /var/mpi
echo "* hard memlock unlimited" >> /etc/security/limits.conf
echo "* hard memlock unlimited" >> /etc/security/limits.conf
echo "* soft memlock unlimited" >> /etc/security/limits.conf
echo "* soft memlock unlimited" >> /etc/security/limits.conf
mkdir ~mpi/.ssh
cp ~root/.ssh/authorized_keys ~mpi/.ssh
chown -R mpi ~mpi/.ssh
su - mpi
ssh-keygen -N "" -P "" -f /var/mpi/.ssh/id_rsa
cat .ssh/id_rsa.pub >> ~/.ssh/authorized_keys
echo "        StrictHostKeyChecking no" >> ~/.ssh/config
exit # exit session as MPI user
exit # exit the root connection to the node
# You can then copy your file from the frontend to the '''mpi''' home directory:
rsync -avz ~/mpi/ mpi@$(head -1 $OAR_NODEFILE):mpi/ # copy the tutorial


mkdir ~mpi/.ssh
Save the newly created image by using tgz-g5k:
cp ~root/.ssh/authorized_keys ~mpi/.ssh
{{Term|location=node|cmd=ssh root@$(head -1 $OAR_NODEFILE) <code class="command">tgz-g5k</code> > $HOME/public/jessie-openmpi.tgz}}
chown -R mpi ~mpi/.ssh
su - mpi
ssh-keygen -N "" -P "" -f /var/mpi/.ssh/id_rsa
cat .ssh/id_rsa.pub >> ~/.ssh/authorized_keys
echo "        StrictHostKeyChecking no" >> ~/.ssh/config
exit # exit session as MPI user
exit # exit the root connection to the node


# You can then copy your file from the frontend to the '''mpi''' home directory:
Copy the description file of jessie-x64-base:
rsync -avz ~/mpi/ mpi@$(head -1 $OAR_NODEFILE):mpi/ # copy the tutorial
{{Term|location=frontend|cmd=kaenv3 -p jessie-x64-base &#124; grep -v visibility > $HOME/public/jessie-openmpi.dsc}}
</pre>


You can save the newly created disk image by using tgz-g5k:
{{Term|location=node|cmd=ssh root@$(head -1 $OAR_NODEFILE)}}
{{Term|location=node|cmd=<code class="command">tgz-g5k</code> /dev/shm/image.tgz}}
Disconnect from the node (exit). From the frontend, copy the image to the public directory:
{{Term|location=frontend|cmd=<code class="command">mkdir</code> ~/public}}
{{Term|location=frontend|cmd=<code class="command">scp</code> root@$(head -1 $OAR_NODEFILE):/dev/shm/image.tgz $HOME/public/jessie-openmpi.tgz}}
Copy the description file of jessie-x64-base:
{{Term|location=frontend|cmd=grep -v visibility /grid5000/descriptions/jessie-x64-base-2016011914.dsc > $HOME/public/jessie-openmpi.dsc}}
Change the image name in the description file; we will use an http URL for multi-site deployment:
Change the image name in the description file; we will use an http URL for multi-site deployment:
<pre class="brush: bash">perl -i -pe "s@server:///grid5000/images/jessie-x64-base-2016011914.tgz@http://public.$(hostname | cut -d. -f2).grid5000.fr/~$USER/jessie-openmpi.tgz@" $HOME/public/jessie-openmpi.dsc
{{Term|location=frontend|cmd=sed -i "s,server://.*images/jessie-x64-base.*,http://public.$(hostname &#124; cut -d. -f2).grid5000.fr/~$USER/jessie-openmpi.tgz," $HOME/public/jessie-openmpi.dsc}}
</pre>
 
Now you can terminate the job:
Release your job:
{{Term|location=frontend|cmd=<code class="command">oardel</code> $OAR_JOB_ID}}
{{Term|location=frontend|cmd=<code class="command">oardel</code> $OAR_JOB_ID}}


=== Using a kadeploy image ===
=== Using your Kadeploy image ===
==== Single site  ====
==== Single site  ====
Reserve some nodes and deploy them:
{{Term|location=frontend|cmd=oarsub -I -t deploy -l /nodes=3}}
{{Term|location=frontend|cmd=oarsub -I -t deploy -l /nodes=3}}
{{Term|location=frontend|cmd=kadeploy3 -a $HOME/public/jessie-openmpi.dsc -f $OAR_NODEFILE -k}}
{{Term|location=frontend|cmd=kadeploy3 -a $HOME/public/jessie-openmpi.dsc -f $OAR_NODEFILE -k}}


Copy machines file and connect to first node:
{{Term|location=frontend|cmd=<code class="command">scp</code> $OAR_NODEFILE  mpi@$(head -1 $OAR_NODEFILE):nodes}}
{{Term|location=frontend|cmd=<code class="command">scp</code> $OAR_NODEFILE  mpi@$(head -1 $OAR_NODEFILE):nodes}}
connect to the first node:
{{Term|location=frontend|cmd=<code class="command">ssh</code> mpi@$(head -1 $OAR_NODEFILE)}}
{{Term|location=frontend|cmd=<code class="command">ssh</code> mpi@$(head -1 $OAR_NODEFILE)}}
Copy your MPI application to other nodes and run it:
{{Term|location=node|cmd=<code class="command">cd</code> ~/mpi/}}
{{Term|location=node|cmd=<code class="command">cd</code> ~/mpi/}}
{{Term|location=node|cmd=<code class="command">/usr/local/bin/mpicc</code> tp.c -o tp}}
{{Term|location=node|cmd=<code class="command">/usr/local/bin/mpicc</code> tp.c -o tp}}
{{Term|location=node|cmd=for node in $(uniq ~/nodes &#124; grep -v $(hostname)); do scp ~/mpi/tp $node:~/mpi/tp; done}}
{{Term|location=node|cmd=<code class="command">/usr/local/bin/mpirun</code> -machinefile ~/nodes ./tp}}
{{Term|location=node|cmd=<code class="command">/usr/local/bin/mpirun</code> -machinefile ~/nodes ./tp}}


<!--
==== Single site with Myrinet hardware ====
==== Single site with Myrinet hardware ====


Line 348: Line 353:


This time we have:
This time we have:
   0:      1 bytes  23865 times -->      2.03 Mbps in      3.77 usec     
   0:      1 bytes  23865 times -- >      2.03 Mbps in      3.77 usec     
   1:      2 bytes  26549 times -->      4.05 Mbps in      3.77 usec     
   1:      2 bytes  26549 times -- >      4.05 Mbps in      3.77 usec     
  ...
  ...
  122: 8388608 bytes      3 times -->  1773.88 Mbps in  36079.17 usec
  122: 8388608 bytes      3 times -- >  1773.88 Mbps in  36079.17 usec
  123: 8388611 bytes      3 times -->  1773.56 Mbps in  36085.69 usec
  123: 8388611 bytes      3 times -- >  1773.56 Mbps in  36085.69 usec


This time we have 3.77usec, which is good, and almost 1.8Gbps. We are using the myrinet interconnect!
This time we have 3.77usec, which is good, and almost 1.8Gbps. We are using the myrinet interconnect!
 
-->
==== Multiple sites  ====
==== Multiple sites  ====
Choose three clusters from 3 different sites.
Choose three clusters from 3 different sites.
{{Term|location=frontend|cmd=<code class="command">oargridsub</code> -t deploy -w 02:00:00 <code class="replace">cluster1</code>:rdef="nodes=2",<code class="replace">cluster2</code>:rdef="nodes=2",<code class="replace">cluster3</code>:rdef="nodes=2" > oargrid.out}}
{{Term|location=frontend|cmd=<code class="command">oargridsub</code> -t deploy -w 02:00:00 <code class="replace">cluster1</code>:rdef="nodes=2",<code class="replace">cluster2</code>:rdef="nodes=2",<code class="replace">cluster3</code>:rdef="nodes=2" > oargrid.out}}
{{Term|location=frontend|cmd=<code class="command">export</code> OARGRID_JOB_ID=$(grep "Grid reservation id" oargrid.out &#124; cut -f2 -d=)}}
{{Term|location=frontend|cmd=<code class="command">export</code> OARGRID_JOB_ID=$(grep "Grid reservation id" oargrid.out &#124; cut -f2 -d=)}}
Get the node list using oargridstat:
Get the node list using oargridstat:
{{Term|location=frontend|cmd=<code class="command">oargridstat</code> -w -l $OARGRID_JOB_ID &#124; grep grid > ~/gridnodes}}
{{Term|location=frontend|cmd=<code class="command">oargridstat</code> -w -l $OARGRID_JOB_ID &#124; grep grid > ~/gridnodes}}


Deploy on all sites using the --multi-server option :
{{Term|location=frontend|cmd=<code class="command">kadeploy3</code> -f gridnodes -a $HOME/public/jessie-openmpi.dsc -k --multi-server}}


Copy machines file and connect to first node:
{{Term|location=frontend|cmd=<code class="command">scp</code>  ~/gridnodes mpi@$(head -1 ~/gridnodes):}}
{{Term|location=frontend|cmd=<code class="command">ssh</code> mpi@$(head -1 ~/gridnodes)}}


Deploy on all sites using the --multi-server option :
Copy your MPI application to other nodes and run it:
{{Term|location=frontend|cmd=<code class="command">kadeploy3</code> -f gridnodes -a $HOME/public/jessie-openmpi.dsc -k --multi-server -o ~/nodes.deployed}}
{{Term|location=frontend|cmd=<code class="command">scp</code> ~/nodes.deployed mpi@$(head -1 ~/nodes.deployed):}}
connect to the first node:
{{Term|location=frontend|cmd=<code class="command">ssh</code> mpi@$(head -1 ~/nodes.deployed)}}
{{Term|location=node|cmd=<code class="command">cd</code> ~/mpi/}}
{{Term|location=node|cmd=<code class="command">cd</code> ~/mpi/}}
{{Term|location=node|cmd=<code class="command">/usr/local/bin/mpirun</code> -machinefile ~/nodes.deployed --mca btl self,sm,tcp --mca opal_net_private_ipv4  "192.168.7.0/24\;192.168.162.0/24\;192.168.160.0/24\;172.24.192.0/18\;172.24.128.0/18\;192.168.200.0/23" tp}}
{{Term|location=node|cmd=<code class="command">/usr/local/bin/mpicc</code> tp.c -o tp}}
{{Term|location=node|cmd=for node in $(uniq ~/gridnodes &#124; grep -v $(hostname)); do scp ~/mpi/tp $node:~/mpi/tp; done}}
{{Term|location=node|cmd=<code class="command">/usr/local/bin/mpirun</code> -machinefile ~/gridnodes --mca btl self,sm,tcp --mca opal_net_private_ipv4  "192.168.0.0/16" --mca btl_tcp_if_exclude ib0,lo,myri0 --mca orte_keep_fqdn_hostnames 1 tp}}


<!--
== MPICH2 ==
== MPICH2 ==


Line 389: Line 399:
  sleep 1
  sleep 1
  mpirun -n $NPROCS <code class="replace">mpich2binary</code>
  mpirun -n $NPROCS <code class="replace">mpich2binary</code>
-->


{{Pages|HPC}}
{{Pages|HPC}}

Revision as of 19:18, 29 January 2016

Note.png Note

This page is actively maintained by the Grid'5000 team. If you encounter problems, please report them (see the Support page). Additionally, as it is a wiki page, you are free to make minor corrections yourself if needed. If you would like to suggest a more fundamental change, please contact the Grid'5000 team.

Introduction

MPI is a programming interface that enables the communication between processes of a distributed memory system. This tutorial focuses on setting up MPI environments on Grid'5000 and only requires a basic understanding of MPI concepts. For instance, you should know that standard MPI processes live in their own memory space and communicate with other processes by calling library routines to send and receive messages. For a comprehensive tutorials on MPI, see the IDRIS course on MPI. There are several freely-available implementations of MPI, including Open MPI, MPICH2, MPICH, LAM, etc. In this practical session, we focus on the Open MPI implementation.

Before following this tutorial you should already have some basic knowledge of OAR (see the Getting Started tutorial) . For the second part of this tutorial, you should also know the basics about OARGRID (see the Advanced OAR tutorial) and Kadeploy (see the Getting Started tutorial).

Running MPI on Grid'5000

When attempting to run MPI on Grid'5000 you will face a number of challenges, ranging from classical setup problems for MPI software to problems specific to Grid'5000. This practical session aims at driving you through the most common use cases, which are:

  • Setting up and starting Open MPI on a default environment using oarsh.
  • Setting up and starting Open MPI on a default environment using the allow_classic_ssh option.
  • Setting up and starting Open MPI to use high performance interconnect.
  • Setting up and starting Open MPI to run on several sites using oargridsub.
  • Setting up and starting your own Open MPI library inside a kadeploy image.

Using Open MPI on a default environment

The default Grid'5000 environment provides Open MPI 1.6.5 (see ompi_info).

Creating a sample MPI program

For the purposes of this tutorial, we create a simple MPI program where the MPI process of rank 0 broadcasts an integer (42) to all the other processes. Then, each process prints its rank, the total number of processes and the value he received from the process 0.

In your home directory, create a file ~/mpi/tp.c and copy the source code:

Terminal.png frontend:
mkdir ~/mpi
Terminal.png frontend:
vi ~/mpi/tp.c
#include <stdio.h>
#include <mpi.h>
#include <time.h> /* for the work function only */

int main (int argc, char *argv []) {
       char hostname[257];
       int size, rank;
       int i, pid;
       int bcast_value = 1;

       gethostname(hostname, sizeof hostname);
       MPI_Init(&argc, &argv);
       MPI_Comm_rank(MPI_COMM_WORLD, &rank);
       MPI_Comm_size(MPI_COMM_WORLD, &size);
       if (!rank) {
            bcast_value = 42;
       }
       MPI_Bcast(&bcast_value,1 ,MPI_INT, 0, MPI_COMM_WORLD );
       printf("%s\t- %d - %d - %d\n", hostname, rank, size, bcast_value);
       fflush(stdout);

       MPI_Barrier(MPI_COMM_WORLD);
       MPI_Finalize();
       return 0;
}

You can then compile your code:

Terminal.png frontend:
mpicc ~/mpi/tp.c -o ~/mpi/tp

Setting up and starting Open MPI on a default environment using oarsh

Submit a job:

Terminal.png frontend:
oarsub -I -l nodes=3

oarsh is the remote shell connector of the OAR batch scheduler. It is a wrapper around the ssh command that handles the configuration of the SSH environment. You can connect to the reserved nodes using oarsh from the submission frontal of the cluster or from any node. As Open MPI defaults to using ssh for remote startup of processes, you need to add the option --mca orte_rsh_agent "oarsh" to your mpirun command line. Open MPI will then use oarsh in place of ssh.

Note.png Note

Debian Wheezy uses plm_rsh_agent instead of orte_rsh_agent

Terminal.png node:
mpirun --mca orte_rsh_agent "oarsh" -machinefile $OAR_NODEFILE ~/mpi/tp

You can also set an environment variable (usually in your .bashrc):

Terminal.png bashrc:
export OMPI_MCA_orte_rsh_agent=oarsh
Terminal.png node:
mpirun -machinefile $OAR_NODEFILE ~/mpi/tp

Open MPI also provides a configuration file for --mca parameters. In your home directory, create a file as ~/.openmpi/mca-params.conf

orte_rsh_agent=oarsh
filem_rsh_agent=oarcp

You should have something like:

helios-52       - 4 - 12 - 42
helios-51       - 0 - 12 - 42
helios-52       - 5 - 12 - 42
helios-51       - 2 - 12 - 42
helios-52       - 6 - 12 - 42
helios-51       - 1 - 12 - 42
helios-51       - 3 - 12 - 42
helios-52       - 7 - 12 - 42
helios-53       - 8 - 12 - 42
helios-53       - 9 - 12 - 42
helios-53       - 10 - 12 - 42
helios-53       - 11 - 12 - 42

You may have (lot's of) warning messages if Open MPI cannot take advantage of any high performance hardware. At this point of the tutorial, this is not important as we will learn how to select clusters with high performance interconnect in greater details below. Error messages might look like this:

[[2616,1],2]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: helios-8.sophia.grid5000.fr

Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
warning:regcache incompatible with malloc
warning:regcache incompatible with malloc
warning:regcache incompatible with malloc

or like this:

[griffon-80.nancy.grid5000.fr:04866] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_mtl_mx: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
[griffon-80.nancy.grid5000.fr:04866] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_btl_mx: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
[griffon-80.nancy.grid5000.fr:04865] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_mtl_mx: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
[griffon-80.nancy.grid5000.fr:04867] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_mtl_mx: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
...

You could use FAQ#How_to_use_MPI_in_Grid5000.3F to avoid this warnings.

Setting up and starting Open MPI on a default environment using allow_classic_ssh

If you prefer using ssh as a connector instead of oarsh, submit a job with the allow_classic_ssh type:

Terminal.png frontend:
oarsub -I -t allow_classic_ssh -l nodes=3

Launch your parallel job:

Terminal.png node:
mpirun -machinefile $OAR_NODEFILE ~/mpi/tp
Note.png Note

Using allow_classic_ssh option avoids OAR resources confinement mechanism with cpuset to restrict the jobs on assigned resources. Therefore, allow_classic_ssh cannot be used with jobs sharing nodes between users (i.e. for reservations at the core level).

Setting up and starting Open MPI to use high performance interconnect

By default, Open MPI tries to use any high performance interconnect (e.g. Infiniband) it can find. Options are available to either select or disable an interconnect:

MCA parameters (--mca) can be used to select the drivers that are used at run-time by Open MPI. To learn more about the MCA parameters, see also:

If you want to disable native support for high performance networks, use --mca btl self,sm,tcp. This will switch to TCP backend of Open MPI.

Nodes with Infiniband interfaces also provide an IP over Infiniband interface (these interfaces are named ibX), and can still be used by TCP backend. To also disable their use, use --mca btl_tcp_if_exclude ib0,lo,myri0 or select a specific interface with --mca btl_tcp_if_include eth1. You will ensure that 'regular' Ethernet interface is used.

We will be using NetPIPE to check the performances of high performance interconnects.

To download, extract and compile NetPIPE, do:

Terminal.png frontend:
cd ~/mpi
Terminal.png frontend:
tar -xf NetPIPE-3.7.1.tar.gz
Terminal.png frontend:
cd NetPIPE-3.7.1
Terminal.png frontend:
make mpi

As NetPipe only works between two MPI processes, reserve only one core in two distinct nodes. If your reservation includes more resources, you will have to create a MPI machinefile file with only two entries, such as follow:

Terminal.png frontend:
oarsub -I -l nodes=2
Terminal.png node:
uniq $OAR_NODEFILE | head -n 2 > /tmp/machinefile

Infiniband hardware is available on several sites. For example, you will find clusters with Infiniband interconnect in Rennes (20G), Nancy (20G) and Grenoble (20G & 40G). Myrinet hardware (10G) is available at Lille and Sophia, but is not supported by the current version of operating system used in Grid'5000 (Debian Jessie) (see Hardware page).

To reserve one core of two distinct nodes with:

  • a 20G InfiniBand interconnect (DDR, Double Data Rate):
Terminal.png frontend:
oarsub -I -l /nodes=2/core=1 -p "ib20g='YES'"
  • a 40G InfiniBand interconnect (QDR, Quad Data Rate):
Terminal.png frontend:
oarsub -I -l /nodes=2/core=1 -p "ib40g='YES'"

To check if the support for InfiniBand is available in Open MPI, run:

Terminal.png node:
ompi_info | grep openib

you should see something like this:

                MCA btl: openib (MCA v2.0, API v2.0, Component v1.4.1)

To start the network benchmark, use:

Terminal.png node:
cd ~/mpi/NetPIPE-3.7.1
Terminal.png node:
mpirun --mca orte_rsh_agent "oarsh" -machinefile $OAR_NODEFILE NPmpi

Without high performance interconnect, results looks like this:

    0:         1 bytes   4080 times -->      0.31 Mbps in      24.40 usec     
    1:         2 bytes   4097 times -->      0.63 Mbps in      24.36 usec     
    ...
    122: 8388608 bytes      3 times -->    896.14 Mbps in   71417.13 usec
    123: 8388611 bytes      3 times -->    896.17 Mbps in   71414.83 usec

The latency is given by the last column for a 1 byte message; the maximum throughput is given by the last line (896.17 Mbps in that case).


With Infiniband 40G (QDR), you should have much better performance that using Ethernet:

 0:       1 bytes  30716 times -->      4.53 Mbps in       1.68 usec
 1:       2 bytes  59389 times -->      9.10 Mbps in       1.68 usec
...
121: 8388605 bytes     17 times -->  25829.13 Mbps in    2477.82 usec
122: 8388608 bytes     20 times -->  25841.35 Mbps in    2476.65 usec
123: 8388611 bytes     20 times -->  25823.40 Mbps in    2478.37 usec

Less than 2 ms of latency and almost 26 Gbit/s of bandwitdh !

More advanced use cases

Running MPI on several sites at once

In this section, we are going to execute a MPI process over several Grid'5000 sites. In this example we will use the following sites: Rennes, Sophia and Grenoble, using oargrid for making the reservation (see the Grid_jobs_management tutorial for more information).

Warning.png Warning

Open MPI tries to figure out the best network interface to use at run time. However, selected networks are not always "production" Grid'5000 network which is routed between sites. In addition, only TCP implementation will work between sites, as high performance networks are only available from inside a side. To ensure correct network is selected, add the option --mca opal_net_private_ipv4 "192.168.0.0/16" --mca btl_tcp_if_exclude ib0,lo,myri0 --mca btl self,sm,tcp to mpirun

The MPI program must be available on each site you want to use. From the frontend of one site, copy the mpi/ directory to the two other sites. You can do that with rsync. Suppose that you are connected in Sophia and that you want to copy Sophia's mpi/ directoy to Grenoble and Rennes.

Terminal.png fsophia:
rsync -avz ~/mpi/ rennes.grid5000.fr:mpi/
Terminal.png fsophia:
rsync -avz ~/mpi/ grenoble.grid5000.fr:mpi/

(you can also add the --delete option to remove extraneous files from the mpi directory of Rennes and Grenoble).

Reserve nodes in each site from any frontend with oargridsub (you can also add options to reserve nodes from specific clusters if you want to):

Terminal.png frontend:
oargridsub -w 02:00:00 rennes:rdef="nodes=2",grenoble:rdef="nodes=2",sophia:rdef="nodes=2" > oargrid.out

Get the oargrid Id and Job key from the output of oargridsub:

Terminal.png frontend:
export OAR_JOB_KEY_FILE=$(grep "SSH KEY" oargrid.out | cut -f2 -d: | tr -d " ")
Terminal.png frontend:
export OARGRID_JOB_ID=$(grep "Grid reservation id" oargrid.out | cut -f2 -d=)

Get the node list using oargridstat and copy the list to the first node:

Terminal.png frontend:
oargridstat -w -l $OARGRID_JOB_ID | grep -v ^$ > ~/gridnodes
Terminal.png frontend:
oarcp ~/gridnodes $(head -1 ~/gridnodes):

Connect to the first node:

Terminal.png frontend:
oarsh $(head -1 ~/gridnodes)

And run your MPI application:

Terminal.png node:
cd ~/mpi/
Terminal.png node:
mpirun -machinefile ~/gridnodes --mca orte_rsh_agent "oarsh" --mca opal_net_private_ipv4 "192.168.0.0/16" --mca btl_tcp_if_exclude ib0,lo,myri0 --mca btl self,sm,tcp tp


Make your own kadeploy image with latest Open MPI version

Building a kadeploy image

If you need latest Open MPI version, or use Open MPI with specific compilation options (like myrinet/MX support), you must recompile Open MPI from sources. In this section we are going to build an image that includes latest version of Open MPI built from sources. Note that you could also build and install this custom Open MPI in your home directory, without requiring deploying (i.e., using ./configure --prefix=$HOME/openmpi/).

This image will be based on jessie-x64-base. Let's deploy it:

Terminal.png frontend:
oarsub -I -t deploy -l nodes=1,walltime=2
Terminal.png frontend:
kadeploy3 -f $OAR_NODEFILE -e jessie-x64-base -k

Connect to reserved node as root:

Terminal.png frontend:
ssh root@$(head -1 $OAR_NODEFILE)

Download Open MPI sources from official website:

Terminal.png frontend:
cd /tmp/
Terminal.png node:
tar -xf openmpi-1.10.2.tar.bz2
Terminal.png node:
cd openmpi-1.10.2

Install build dependencies:

Terminal.png node:
apt-get -y install g++ make gfortran f2c libblas-dev

Configure, compile and install:

Terminal.png node:
./configure --libdir=/usr/local/lib64 --with-memory-manager=none
Terminal.png node:
make -j8
Terminal.png node:
make install

To run our MPI applications, we create a dedicated user named mpi. We add it to the rdma group to allow access to Infiniband hardware. Also, we copy the ~root/authorized_keys files so that we can login as user mpi from the frontend. We also create an SSH key for identifying the mpi user (needed by Open MPI).

useradd -m -g rdma mpi -d /var/mpi
echo "* hard memlock unlimited" >> /etc/security/limits.conf
echo "* soft memlock unlimited" >> /etc/security/limits.conf

mkdir ~mpi/.ssh
cp ~root/.ssh/authorized_keys ~mpi/.ssh
chown -R mpi ~mpi/.ssh
su - mpi
ssh-keygen -N "" -P "" -f /var/mpi/.ssh/id_rsa
cat .ssh/id_rsa.pub >> ~/.ssh/authorized_keys
echo "        StrictHostKeyChecking no" >> ~/.ssh/config
exit # exit session as MPI user
exit # exit the root connection to the node

# You can then copy your file from the frontend to the mpi home directory:
rsync -avz ~/mpi/ mpi@$(head -1 $OAR_NODEFILE):mpi/ # copy the tutorial

Save the newly created image by using tgz-g5k:

Terminal.png node:
ssh root@$(head -1 $OAR_NODEFILE) tgz-g5k > $HOME/public/jessie-openmpi.tgz

Copy the description file of jessie-x64-base:

Terminal.png frontend:
kaenv3 -p jessie-x64-base | grep -v visibility > $HOME/public/jessie-openmpi.dsc

Change the image name in the description file; we will use an http URL for multi-site deployment:

Terminal.png frontend:
sed -i "s,server://.*images/jessie-x64-base.*,http://public.$(hostname | cut -d. -f2).grid5000.fr/~$USER/jessie-openmpi.tgz," $HOME/public/jessie-openmpi.dsc

Release your job:

Terminal.png frontend:
oardel $OAR_JOB_ID

Using your Kadeploy image

Single site

Reserve some nodes and deploy them:

Terminal.png frontend:
oarsub -I -t deploy -l /nodes=3
Terminal.png frontend:
kadeploy3 -a $HOME/public/jessie-openmpi.dsc -f $OAR_NODEFILE -k

Copy machines file and connect to first node:

Terminal.png frontend:
scp $OAR_NODEFILE mpi@$(head -1 $OAR_NODEFILE):nodes
Terminal.png frontend:
ssh mpi@$(head -1 $OAR_NODEFILE)

Copy your MPI application to other nodes and run it:

Terminal.png node:
cd ~/mpi/
Terminal.png node:
/usr/local/bin/mpicc tp.c -o tp
Terminal.png node:
for node in $(uniq ~/nodes | grep -v $(hostname)); do scp ~/mpi/tp $node:~/mpi/tp; done
Terminal.png node:
/usr/local/bin/mpirun -machinefile ~/nodes ./tp

Multiple sites

Choose three clusters from 3 different sites.

Terminal.png frontend:
oargridsub -t deploy -w 02:00:00 cluster1:rdef="nodes=2",cluster2:rdef="nodes=2",cluster3:rdef="nodes=2" > oargrid.out
Terminal.png frontend:
export OARGRID_JOB_ID=$(grep "Grid reservation id" oargrid.out | cut -f2 -d=)

Get the node list using oargridstat:

Terminal.png frontend:
oargridstat -w -l $OARGRID_JOB_ID | grep grid > ~/gridnodes

Deploy on all sites using the --multi-server option :

Terminal.png frontend:
kadeploy3 -f gridnodes -a $HOME/public/jessie-openmpi.dsc -k --multi-server

Copy machines file and connect to first node:

Terminal.png frontend:
scp ~/gridnodes mpi@$(head -1 ~/gridnodes):
Terminal.png frontend:
ssh mpi@$(head -1 ~/gridnodes)

Copy your MPI application to other nodes and run it:

Terminal.png node:
cd ~/mpi/
Terminal.png node:
/usr/local/bin/mpicc tp.c -o tp
Terminal.png node:
for node in $(uniq ~/gridnodes | grep -v $(hostname)); do scp ~/mpi/tp $node:~/mpi/tp; done
Terminal.png node:
/usr/local/bin/mpirun -machinefile ~/gridnodes --mca btl self,sm,tcp --mca opal_net_private_ipv4 "192.168.0.0/16" --mca btl_tcp_if_exclude ib0,lo,myri0 --mca orte_keep_fqdn_hostnames 1 tp