Run MPI On Grid'5000: Difference between revisions
Jgaidamour (talk | contribs) |
Jgaidamour (talk | contribs) No edit summary |
||
Line 1: | Line 1: | ||
{{Maintainer|Nicolas Niclausse}} | {{Maintainer|Nicolas Niclausse}} | ||
{{Portal|User}} | {{Portal|User}} | ||
{{Portal|Tutorial}} | {{Portal|Tutorial}} | ||
{{Portal|MPI}} | |||
{{Portal| | |||
{{TutorialHeader}} | {{TutorialHeader}} | ||
= Running MPI on Grid'5000 = | |||
When attempting to run MPI on Grid'5000 you'll be faced with a number of challenges, ranging from classical setup problems for MPI software to problems specific to Grid'5000. This practical session aims at driving you through the most common use cases, which are: | When attempting to run MPI on Grid'5000 you'll be faced with a number of challenges, ranging from classical setup problems for MPI software to problems specific to Grid'5000. This practical session aims at driving you through the most common use cases, which are: | ||
* Setting up and starting Open MPI on a default environment using <code class='command'>oarsh</code>. | * Setting up and starting Open MPI on a default environment using <code class='command'>oarsh</code>. | ||
* Setting up and starting Open MPI to use high performance interconnect. | * Setting up and starting Open MPI to use high performance interconnect. | ||
* Setting up and starting Open MPI to run on several sites using <code class='command'>oargridsub</code>. | * Setting up and starting Open MPI to run on several sites using <code class='command'>oargridsub</code>. | ||
* Setting up and starting Open MPI on a default environment using a <code class='command'>allow_classic_ssh</code>. | |||
* Setting up and starting Open MPI on a kadeploy image. | * Setting up and starting Open MPI on a kadeploy image. | ||
= Using Open MPI on a default environment = | Several implementation of MPI exist: Open MPI, MPICH2, MPICH, LAM, etc. | ||
In this practical session, we will focus on [http://www.open-mpi.org Open MPI]. | |||
=Pre-requisite= | |||
* Basic knowledge of MPI; if you don't know MPI, you can read: [[Grid_computation]]. For a more comprehensive tutorials on MPI, see [http://www.idris.fr/data/cours/parallel/mpi/choix_doc.html IDRIS courses on MPI]. | |||
* Knowledge of OAR ([[Getting Started]] tutorial), and for the second part of this tutorial, basic knowledge of OARGRID ([[Advanced OAR]] tutorial) and Kadeploy ([[Getting Started]] tutorial). | |||
=Overview= | |||
Since june 2010 the same default environment is available on every sites, therefore, you can use the default MPI library available on this environment (Open MPI 1.4.5). | |||
= Using Open MPI on a default environment= | |||
==Create a sample MPI program== | |||
* We will use a very basic MPI program to test OAR/MPI. Create a file <code class="file">$HOME/src/mpi/tp.c</code> and copy the following source code: | |||
{{Term|location=frontend|cmd=<code class="command">mkdir</code> -p $HOME/src/mpi}} | |||
{{Term|location=frontend|cmd=<code class="command">vi</code> $HOME/src/mpi/tp.c}} | |||
the source code: | |||
<syntaxhighlight lang="c">#include <stdio.h> | <syntaxhighlight lang="c">#include <stdio.h> | ||
#include <mpi.h> | #include <mpi.h> | ||
Line 44: | Line 45: | ||
int bcast_value = 1; | int bcast_value = 1; | ||
gethostname(hostname, sizeof hostname); | gethostname (hostname, sizeof hostname); | ||
MPI_Init(&argc, &argv); | MPI_Init (&argc, &argv); | ||
MPI_Comm_rank(MPI_COMM_WORLD, &rank); | MPI_Comm_rank (MPI_COMM_WORLD, &rank); | ||
MPI_Comm_size(MPI_COMM_WORLD, &size); | MPI_Comm_size (MPI_COMM_WORLD, &size); | ||
if (!rank) { | if (!rank) { | ||
bcast_value = 42; | bcast_value = 42; | ||
} | } | ||
MPI_Bcast(&bcast_value,1 ,MPI_INT, 0, MPI_COMM_WORLD ); | MPI_Bcast (&bcast_value,1 ,MPI_INT, 0, MPI_COMM_WORLD ); | ||
printf("%s\t- %d - %d - %d\n", hostname, rank, size, bcast_value); | printf("%s\t- %d - %d - %d\n", hostname, rank, size, bcast_value); | ||
fflush(stdout); | fflush(stdout); | ||
MPI_Barrier(MPI_COMM_WORLD); | MPI_Barrier (MPI_COMM_WORLD); | ||
MPI_Finalize(); | MPI_Finalize (); | ||
return 0; | return 0; | ||
} | } | ||
</syntaxhighlight> | </syntaxhighlight> | ||
This program uses MPI to communicate between processes; the MPI process of rank 0 will broadcast an integer (value 42) to all the other processes. Then, each process prints its rank, the total number of processes, and the value he got from process 0. | |||
== Setting up and starting Open MPI on a default environment using <code class=command>oarsh</code> == | == Setting up and starting Open MPI on a default environment using <code class=command>oarsh</code> == | ||
Line 69: | Line 68: | ||
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -l nodes=3}} | {{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -l nodes=3}} | ||
Compile your code: | |||
{{Term|location=node|cmd=<code class="command">mpicc</code> src/mpi/tp.c -o src/mpi/tp}} | |||
<code class=command>oarsh</code> is the default connector used when you reserve a node. To be able to use this connector, you need to add the option <code class=command>--mca plm_rsh_agent "oarsh"</code> to mpirun. | |||
{{Term|location=node|cmd=<code class="command">mpirun</code> --mca | {{Term|location=node|cmd=<code class="command">mpirun</code> --mca plm_rsh_agent "oarsh" -machinefile $OAR_NODEFILE $HOME/src/mpi/tp}} | ||
You can also set an environment variable (usually in your .bashrc): | You can also set an environment variable (usually in your .bashrc): | ||
{{Term|location=bashrc|cmd=<code class="command">export</code> | {{Term|location=bashrc|cmd=<code class="command">export</code> OMPI_MCA_plm_rsh_agent=oarsh}} | ||
Open MPI also provides a | Open MPI also provides a config file solution. In your home, create a file as <code class="file">~/.openmpi/mca-params.conf</code> | ||
<pre class="brush: bash"> | <pre class="brush: bash"> | ||
plm_rsh_agent=oarsh | |||
filem_rsh_agent=oarcp | filem_rsh_agent=oarcp | ||
</pre> | </pre> | ||
Line 99: | Line 98: | ||
helios-53 - 11 - 12 - 42 | helios-53 - 11 - 12 - 42 | ||
You may have (lot's of) warning messages if Open MPI | You may have (lot's of) warning messages if Open MPI doesn't find high performance hardware: don't be afraid, it's normal but you could use [[FAQ#How_to_use_MPI_in_Grid5000.3F]] to avoid them. This can looks like this: | ||
[[2616,1],2]: A high-performance Open MPI point-to-point messaging module | [[2616,1],2]: A high-performance Open MPI point-to-point messaging module | ||
was unable to find any relevant network interfaces: | was unable to find any relevant network interfaces: | ||
Line 120: | Line 119: | ||
... | ... | ||
exit the job. | |||
== Setting up and starting Open MPI to use high performance interconnect == | == Setting up and starting Open MPI to use high performance interconnect == | ||
By default, Open MPI tries to use any high performance interconnect | By default, Open MPI tries to use any high performance interconnect he can find. But it works only if the related libraries were found during the compilation of Open Mpi (not during the compilation of your application). It should work if you built Open MPI on a wheezy-x64 environment, and it also works correctly on the default environment. | ||
{{Note|text= If you want to disable support for high performance network, use '''--mca btl self,sm,tcp''', but beware: Open MPI will use all the TCP networks available, and so IPoverIB will be used also if available; use '''--mca btl_tcp_if_exclude ib0,lo,myri0''' to disable also IP emulation of high performance interconnect}} | |||
We will be using the Netpipe tool to check if the high performance interconnect is really used: download it from this URL: http://pkgs.fedoraproject.org/repo/pkgs/NetPIPE/NetPIPE-3.7.1.tar.gz/5f720541387be065afdefc81d438b712/NetPIPE-3.7.1.tar.gz | |||
{{Warning|text=NetPipe runs only between two MPI processes. Please make sure you have only two lines within your <code>node</code> file. Otherwise, MPI will launch more than two processes, which is incompatible with NetPipe.}} | |||
{{Term|location=frontend|cmd=<code class="command">wget</code> http://pkgs.fedoraproject.org/repo/pkgs/NetPIPE/NetPIPE-3.7.1.tar.gz/5f720541387be065afdefc81d438b712/NetPIPE-3.7.1.tar.gz }} | {{Term|location=frontend|cmd=<code class="command">wget</code> http://pkgs.fedoraproject.org/repo/pkgs/NetPIPE/NetPIPE-3.7.1.tar.gz/5f720541387be065afdefc81d438b712/NetPIPE-3.7.1.tar.gz }} | ||
{{Term|location=frontend|cmd=<code class="command">tar</code> | {{Note|text= Think to configure proxy if <code class="command">wget</code> freezes on "connecting", see [https://www.grid5000.fr/mediawiki/index.php/Getting_Started#Customizing_nodes_and_accessing_the_Internet concerned part in Getting Started] }} | ||
{{Term|location= | {{Term|location=frontend|cmd=<code class="command">oarsub</code> -I}} | ||
{{Term|location= | {{Term|location=node|cmd=<code class="command">cd</code> $HOME/src/mpi}} | ||
Unarchive Netpipe | |||
{{Term|location=node|cmd=<code class="command">tar</code> zvxf ~/NetPIPE-3.7.1.tar.gz}} | |||
{{Term|location=node|cmd=<code class="command">cd</code> NetPIPE-3.7.1}} | |||
Compile | |||
{{Term|location=node|cmd=<code class="command">make</code> mpi}} | |||
=== Infiniband hardware : === | |||
Infiniband hardware is available on severals sites (see [https://www.grid5000.fr/mediawiki/index.php/Special:G5KHardware#High_performance_network_families Hardware page]): | |||
* Rennes (20G) | |||
* Nancy (20G) | |||
* Grenoble (20G & 40G) | |||
To reserve one core on two nodes with a 10G infiniband interconnect: | |||
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -l /nodes=2/core=1 -p "ib10g='YES'"}} | |||
or for 20G: | |||
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -l /nodes=2/core=1 -p "ib20g='YES'"}} | {{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -l /nodes=2/core=1 -p "ib20g='YES'"}} | ||
To test the network: | To test the network: | ||
{{Term|location=node|cmd=<code class="command">cd</code> | {{Term|location=node|cmd=<code class="command">cd</code> $HOME/src/mpi/NetPIPE-3.7.1}} | ||
{{Term|location=node|cmd=<code class="command">mpirun</code> --mca | {{Term|location=node|cmd=<code class="command">mpirun</code> --mca plm_rsh_agent "oarsh" -machinefile $OAR_NODEFILE NPmpi}} | ||
To check if the support for infiniband is available in Open MPI, run: | |||
{{Term|location=node|cmd=<code class="command">ompi_info</code> | grep openib}} | {{Term|location=node|cmd=<code class="command">ompi_info</code> | grep openib}} | ||
you should see something like this: | you should see something like this: | ||
MCA btl: openib (MCA v2.0, API v2.0, Component v1.4.1) | MCA btl: openib (MCA v2.0, API v2.0, Component v1.4.1) | ||
With Infiniband 40G (QDR), you should have much better performance that using ethernet or Myrinet 2G: | |||
0: 1 bytes 30716 times --> 4.53 Mbps in 1.68 usec | |||
1: 2 bytes 59389 times --> 9.10 Mbps in 1.68 usec | |||
... | |||
121: 8388605 bytes 17 times --> 25829.13 Mbps in 2477.82 usec | |||
The latency is given by the last column for a 1 byte message; the maximum throughput is given by the last line | 122: 8388608 bytes 20 times --> 25841.35 Mbps in 2476.65 usec | ||
123: 8388611 bytes 20 times --> 25823.40 Mbps in 2478.37 usec | |||
Less than 2 microsec in latency and almost 26Gbit/s in bandwitdh ! | |||
=== Myrinet hardware : === | |||
Myrinet hardware is available on severals sites (see [https://www.grid5000.fr/mediawiki/index.php/Special:G5KHardware#High_performance_network_families Hardware page]): | |||
* Lille (10G) | |||
* Sophia (10G) | |||
To reserve one core on two nodes with a 10G Myrinet interconnect:<br> | |||
Myrinet 10G: | |||
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -l /nodes=2/core=1 -p "myri10g='YES'"}} | |||
To test the network: | |||
{{Term|location=node|cmd=<code class="command">cd</code> $HOME/src/mpi/NetPIPE-3.7.1}} | |||
{{Term|location=node|cmd=<code class="command">mpirun</code> --mca plm_rsh_agent "oarsh" -machinefile $OAR_NODEFILE NPmpi}} | |||
you should have something like that: | |||
0: 1 bytes 4080 times --> 0.31 Mbps in 24.40 usec | |||
1: 2 bytes 4097 times --> 0.63 Mbps in 24.36 usec | |||
... | |||
122: 8388608 bytes 3 times --> 896.14 Mbps in 71417.13 usec | |||
123: 8388611 bytes 3 times --> 896.17 Mbps in 71414.83 usec | |||
The minimum latency is given by the last column for a 1 byte message; | |||
the maximum throughput is given by the last line, 896.17 Mbps in this case. | |||
So in this case a latency of 24usec is very high, therefore the myrinet was not used as expected. It can happen if Open MPI has not found the mx libraries during the compilation. You can check this with : | |||
{{Term|location=node|cmd=<code class="command">ompi_info </code> | grep mx }} | |||
If the output is empty, there is no mx support builtin. | |||
With a | With a myrinet2G network, typical result should looks like this: | ||
This time we have: | This time we have: | ||
Line 192: | Line 208: | ||
122: 8388608 bytes 3 times --> 1773.88 Mbps in 36079.17 usec | 122: 8388608 bytes 3 times --> 1773.88 Mbps in 36079.17 usec | ||
123: 8388611 bytes 3 times --> 1773.56 Mbps in 36085.69 usec | 123: 8388611 bytes 3 times --> 1773.56 Mbps in 36085.69 usec | ||
In this example, we have 3.77usec and almost 1.8Gbps | |||
= More advanced use cases= | = More advanced use cases= | ||
Line 208: | Line 215: | ||
== Running MPI on several sites at once == | == Running MPI on several sites at once == | ||
In this tutorial, we use the following sites: rennes, sophia and grenoble. For | In this tutorial, we use the following sites: rennes, sophia and grenoble. But you can now use any site. For running MPI applications on several site, we will be using oargrid. See [[Grid_jobs_management]] tutorial for more information. | ||
{{Warning|text=There is still a problem when using lille and luxembourg nodes simultaneously.}} | {{Warning|text=There is still a problem when using lille and luxembourg nodes simultaneously.}} | ||
{{Warning|text=Open MPI tries to | {{Warning|text=Open MPI tries to figure out the best network interface at run time, and he also assumes that some networks are not routed between sites. To avoid this kind of problems, we must add the option '''--mca opal_net_private_ipv4 "192.168.160.0/24\;192.168.14.0/23" --mca btl_tcp_if_exclude ib0,lo,myri0 ''' to mpirun}} | ||
{{Note|text=For multiple sites, we | {{Note|text=For multiple sites, we may want to only use tcp , and not native mx and native infiniband; to do this, add this option to mpirun: '''--mca btl self,sm,tcp'''}} | ||
Reserve nodes on the | Synchronize the src/mpi directory, from the frontend (the tp binary must be available on all sites), to the two other sites. Here we supposed we are connected on sophia, and we want to synchronize to grenoble and rennes. | ||
{{Term|location=frontend|cmd=<code class="command">oargridsub</code> -w 02:00:00 <code class="replace">rennes</code>:rdef="nodes=2",<code class="replace">grenoble</code>:rdef="nodes=2",<code class="replace">sophia</code>:rdef="nodes=2" > oargrid.out}} | {{Term|location=frontend.sophia|cmd=<code class="command">ssh</code> rennes mkdir -p src/mpi/}} | ||
{{Term|location=frontend.sophia|cmd=<code class="command">rsync</code> --delete -avz ~/src/mpi/ rennes.grid5000.fr:src/mpi/}} | |||
{{Term|location=frontend.sophia|cmd=<code class="command">ssh</code> grenoble mkdir -p src/mpi/}} | |||
{{Term|location=frontend.sophia|cmd=<code class="command">rsync</code> --delete -avz ~/src/mpi/ grenoble.grid5000.fr:src/mpi/}} | |||
Reserve nodes on the 3 sites with oargridsub (you can reserve nodes from specific clusters if you want to). | |||
{{Term|location=frontend|cmd=<code class="command">oargridsub</code> -w 02:00:00 <code class="replace">rennes</code>:rdef="nodes=2",<code class="replace">grenoble</code>:rdef="nodes=2",<code class="replace">sophia</code>:rdef="nodes=2" > oargrid.out}} | |||
Get the oargrid Id and Job key from the output of oargridsub: | Get the oargrid Id and Job key from the output of oargridsub: | ||
{{Term|location=frontend|cmd=<code class="command">export</code> OAR_JOB_KEY_FILE= | {{Term|location=frontend|cmd=<code class="command">export</code> OAR_JOB_KEY_FILE=`grep "SSH KEY" oargrid.out | cut -f2 -d: | tr -d " "`}} | ||
{{Term|location=frontend|cmd=<code class="command">export</code> OARGRID_JOB_ID= | {{Term|location=frontend|cmd=<code class="command">export</code> OARGRID_JOB_ID=`grep "Grid reservation id" oargrid.out | cut -f2 -d=`}} | ||
Get the node list using oargridstat and copy the list to the first node: | Get the node list using oargridstat and copy the list to the first node: | ||
{{Term|location=frontend|cmd=<code class="command">oargridstat</code> -w -l $OARGRID_JOB_ID | grep -v ^$ > ~/gridnodes}} | {{Term|location=frontend|cmd=<code class="command">oargridstat</code> -w -l $OARGRID_JOB_ID | grep -v ^$ > ~/gridnodes}} | ||
{{Term|location=frontend|cmd=<code class="command">oarcp</code> ~/gridnodes | {{Term|location=frontend|cmd=<code class="command">oarcp</code> ~/gridnodes `head -1 ~/gridnodes`:}} | ||
Connect to the first node: | Connect to the first node: | ||
{{Term|location=frontend|cmd=<code class="command">oarsh</code> | {{Term|location=frontend|cmd=<code class="command">oarsh</code> `head -1 ~/gridnodes`}} | ||
And run your MPI application: | And run your MPI application: | ||
{{Term|location=node|cmd=<code class="command">cd</code> | {{Term|location=node|cmd=<code class="command">cd</code> $HOME/src/mpi/}} | ||
{{Term|location=node|cmd=<code class="command">mpirun</code> -machinefile ~/gridnodes --mca | {{Term|location=node|cmd=<code class="command">mpirun</code> -machinefile ~/gridnodes --mca plm_rsh_agent "oarsh" --mca opal_net_private_ipv4 "192.168.160.0/24\;192.168.14.0/23" --mca btl_tcp_if_exclude ib0,lo,myri0 --mca btl self,sm,tcp tp}} | ||
==Compilation | ==Compilation [optional]== | ||
If you want to use a custom version of Open MPI, you can compile it | If you want to use a custom version of Open MPI, you can compile it in your home directory. | ||
Make an interactive reservation and compile Open MPI | |||
* Make an interactive reservation and compile Open MPI on a node : | |||
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I}} | {{Term|location=frontend|cmd=<code class="command">oarsub</code> -I}} | ||
{{Term|location=node|cmd=cd /tmp/}} | {{Term|location=node|cmd=cd /tmp/}} | ||
{{Term|location=frontend|cmd=<code class="command">wget</code> http://www.open-mpi.org/software/ompi/v1. | * Get Open MPI (or here: http://www.open-mpi.org/software/ompi/v1.4/) | ||
{{Term|location=node|cmd=tar -xf openmpi-1. | {{Term|location=frontend|cmd=<code class="command">export</code> http_proxy=http://proxy:3128/}} | ||
{{Term|location=node|cmd=cd openmpi-1. | {{Term|location=frontend|cmd=<code class="command">wget</code> http://www.open-mpi.org/software/ompi/v1.4/downloads/openmpi-1.4.3.tar.gz}} | ||
Unarchive Open MPI | |||
{{Term|location=node|cmd=tar -xf openmpi-1.4.3-g5k.tar.gz}} | |||
{{Term|location=node|cmd=cd openmpi-1.4.3}} | |||
configure (wait ~1mn30s) | |||
{{Term|location=node|cmd=<code class="command">./configure</code> --prefix=$HOME/openmpi/ --with-memory-manager=none}} | {{Term|location=node|cmd=<code class="command">./configure</code> --prefix=$HOME/openmpi/ --with-memory-manager=none}} | ||
The source tarball includes a small patch to make Open MPI works on grid5000 on several sites simultaneously. For the curious, the patch is: | |||
<pre class="brush: c"> | |||
--- ompi/mca/btl/tcp/btl_tcp_proc.c.orig 2010-03-23 14:01:28.000000000 +0100 | |||
+++ ompi/mca/btl/tcp/btl_tcp_proc.c 2010-03-23 14:01:50.000000000 +0100 | |||
@@ -496,7 +496,7 @@ | |||
local_interfaces[i]->ipv4_netmask)) { | |||
weights[i][j] = CQ_PRIVATE_SAME_NETWORK; | |||
} else { | |||
- weights[i][j] = CQ_PRIVATE_DIFFERENT_NETWORK; | |||
+ weights[i][j] = CQ_NO_CONNECTION; | |||
} | |||
best_addr[i][j] = peer_interfaces[j]->ipv4_endpoint_addr; | |||
} | |||
</pre> | |||
and compile: (wait ~2mn30s) | |||
{{Term|location=node|cmd=<code class="command">make</code> -j4}} | |||
* Install it on your home directory (in $HOME/openmpi/ ) | * Install it on your home directory (in $HOME/openmpi/ ) | ||
{{Term|location=node|cmd=<code class="command">make install</code>}} | {{Term|location=node|cmd=<code class="command">make install</code>}} | ||
Then you can do the same steps as before, but with <code class='command'>$HOME/openmpi/bin/mpicc</code> and <code class='command'>$HOME/openmpi/bin/mpirun</code> | |||
== Setting up and starting Open MPI on a kadeploy image == | == Setting up and starting Open MPI on a kadeploy image == | ||
=== Building a kadeploy image === | === Building a kadeploy image === | ||
The default Open MPI version available in | The default Open MPI version available in debian based distributions is not compiled with high performances libraries like myrinet/MX, therefore we must recompile Open MPI from sources. Fortunately, every default image (wheezy-x64-XXX) but the min variant includes the libraries for high performance interconnects, and Open MPI will find them at compile time. | ||
We will create a kadeploy image based on an existing one. | We will create a kadeploy image based on an existing one. | ||
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -t deploy -l nodes=1,walltime=2 }} | {{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -t deploy -l nodes=1,walltime=2 }} | ||
{{Term|location=frontend|cmd=<code class="command">kadeploy3</code> -f $OAR_NODEFILE -e | {{Term|location=frontend|cmd=<code class="command">kadeploy3</code> -f $OAR_NODEFILE -e wheezy-x64-base -k }} | ||
Download Open MPI tarball if you don't already have it: | |||
{{Term|location=frontend|cmd=<code class="command">export</code> http_proxy=http://proxy:3128/}} | |||
Download Open MPI: | {{Term|location=frontend|cmd=<code class="command">wget</code> http://www.open-mpi.org/software/ompi/v1.4/downloads/openmpi-1.4.3.tar.gz}} | ||
{{Term|location=frontend|cmd=<code class="command"> | Copy Open MPI tarball on the first node: | ||
{{Term|location=frontend|cmd=<code class="command">wget</code> http://www.open-mpi.org/software/ompi/v1. | {{Term|location=frontend|cmd=scp openmpi-1.4.3-g5k.tar.gz root@`head -1 $OAR_NODEFILE`:/tmp}} | ||
{{Term|location=node|cmd=tar -xf openmpi-1. | Then connect on the deployed node as root, and install openmpi: | ||
{{Term|location=node|cmd=cd openmpi-1. | {{Term|location=frontend|cmd=<code class="command">ssh root@</code>`head -1 $OAR_NODEFILE`}} | ||
Install | Unarchive Open MPI: | ||
{{Term|location=node|cmd=<code class="command">apt-get</code> -y install | {{Term|location=node|cmd=cd /tmp/}} | ||
{{Term|location=node|cmd=tar -xf openmpi-1.4.3-g5k.tar.gz}} | |||
{{Term|location=node|cmd=cd openmpi-1.4.3}} | |||
Install gfortran, f2c and blas library: | |||
{{Term|location=node|cmd=<code class="command">apt-get</code> -y install gfortran f2c libblas-dev}} | |||
Configure and compile: | Configure and compile: | ||
{{Term|location=node|cmd=./configure --libdir=/usr/local/lib64 --with-memory-manager=none}} | {{Term|location=node|cmd=./configure --libdir=/usr/local/lib64 --with-memory-manager=none}} | ||
{{Term|location=node|cmd=make - | {{Term|location=node|cmd=make -j4}} | ||
{{Term|location=node|cmd=make install}} | {{Term|location=node|cmd=make install}} | ||
Create a dedicated user named mpi, in the group rdma (for infiniband) | |||
<pre class="brush: bash"> | <pre class="brush: bash"> | ||
useradd -m -g rdma mpi -d /var/mpi | useradd -m -g rdma mpi -d /var/mpi | ||
echo "* hard memlock unlimited" >> /etc/security/limits.conf | echo "* hard memlock unlimited" >> /etc/security/limits.conf | ||
echo "* soft memlock unlimited" >> /etc/security/limits.conf | echo "* soft memlock unlimited" >> /etc/security/limits.conf | ||
mkdir ~mpi/.ssh | mkdir ~mpi/.ssh | ||
cp ~ | cp ~/.ssh/authorized_keys ~mpi/.ssh | ||
chown -R mpi ~mpi/.ssh | chown -R mpi ~mpi/.ssh | ||
su - mpi | su - mpi | ||
mkdir src | |||
ssh-keygen -N "" -P "" -f /var/mpi/.ssh/id_rsa | ssh-keygen -N "" -P "" -f /var/mpi/.ssh/id_rsa | ||
cat .ssh/id_rsa.pub >> ~/.ssh/authorized_keys | cat .ssh/id_rsa.pub >> ~/.ssh/authorized_keys | ||
echo " StrictHostKeyChecking no" >> ~/.ssh/config | echo " StrictHostKeyChecking no" >> ~/.ssh/config | ||
exit | exit | ||
exit | exit | ||
rsync -avz ~/src/mpi/ mpi@`head -1 $OAR_NODEFILE`:src/mpi/ | |||
ssh root@`head -1 $OAR_NODEFILE` | |||
rsync -avz ~/mpi/ mpi@ | |||
</pre> | </pre> | ||
Create the image using tgz-g5k: | |||
{{Term|location=node|cmd=<code class="command">tgz-g5k</code> /dev/shm/image.tgz}} | {{Term|location=node|cmd=<code class="command">tgz-g5k</code> /dev/shm/image.tgz}} | ||
Disconnect from the node (exit). From the frontend, copy the image to the public directory: | Disconnect from the node (exit). From the frontend, copy the image to the public directory: | ||
{{Term|location=frontend|cmd=<code class="command">mkdir</code> | {{Term|location=frontend|cmd=<code class="command">mkdir</code> -p $HOME/public}} | ||
{{Term|location=frontend|cmd=<code class="command">scp</code> root@ | {{Term|location=frontend|cmd=<code class="command">scp</code> root@`head -1 $OAR_NODEFILE`:/dev/shm/image.tgz $HOME/public/wheezy-openmpi.tgz}} | ||
Copy the description file of | Copy the description file of wheezy-x64-base: | ||
{{Term|location=frontend|cmd=grep -v visibility /grid5000/descriptions/ | {{Term|location=frontend|cmd=grep -v visibility /grid5000/descriptions/wheezy-x64-base-1.4.dsc > $HOME/public/wheezy-openmpi.dsc}} | ||
Change the image name in the description file; we will use an http URL for multi-site | Change the image name in the description file; we will use an http URL for multi-site deploiement: | ||
<pre class="brush: bash">perl -i -pe "s@server:///grid5000/images/ | <pre class="brush: bash">perl -i -pe "s@server:///grid5000/images/wheezy-x64-base-1.4.tgz@http://public.$(hostname | cut -d. -f2).grid5000.fr/~$USER/wheezy-openmpi.tgz@" $HOME/public/wheezy-openmpi.dsc | ||
</pre> | </pre> | ||
Now you can terminate the job: | Now you can terminate the job: | ||
Line 321: | Line 342: | ||
==== Single site ==== | ==== Single site ==== | ||
{{Term|location=frontend|cmd=oarsub -I -t deploy -l /nodes=3}} | {{Term|location=frontend|cmd=oarsub -I -t deploy -l /nodes=3}} | ||
{{Term|location=frontend|cmd=kadeploy3 -a $HOME/public/ | {{Term|location=frontend|cmd=kadeploy3 -a $HOME/public/wheezy-openmpi.dsc -f $OAR_NODEFILE -k}} | ||
{{Term|location=frontend|cmd=<code class="command">scp</code> $OAR_NODEFILE mpi@ | {{Term|location=frontend|cmd=<code class="command">scp</code> $OAR_NODEFILE mpi@`head -1 $OAR_NODEFILE`:nodes}} | ||
connect to the first node: | connect to the first node: | ||
{{Term|location=frontend|cmd=<code class="command">ssh</code> mpi@ | {{Term|location=frontend|cmd=<code class="command">ssh</code> mpi@`head -1 $OAR_NODEFILE`}} | ||
{{Term|location=node|cmd=<code class="command">cd</code> | {{Term|location=node|cmd=<code class="command">cd</code> $HOME/src/mpi/}} | ||
{{Term|location=node|cmd=<code class="command">/usr/local/bin/mpicc</code> tp.c -o tp}} | {{Term|location=node|cmd=<code class="command">/usr/local/bin/mpicc</code> tp.c -o tp}} | ||
{{Term|location=node|cmd=<code class="command">/usr/local/bin/mpirun</code> -machinefile ~/nodes ./tp}} | {{Term|location=node|cmd=<code class="command">/usr/local/bin/mpirun</code> -machinefile ~/nodes ./tp}} | ||
Line 333: | Line 354: | ||
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -t deploy -l /nodes=2 -p "myri<code class="replace">10</code>g='YES'"}} | {{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -t deploy -l /nodes=2 -p "myri<code class="replace">10</code>g='YES'"}} | ||
{{Term|location=frontend|cmd=<code class="command">kadeploy3</code> -k -a ~/public/ | {{Term|location=frontend|cmd=<code class="command">kadeploy3</code> -k -a ~/public/wheezy-openmpi.dsc -f $OAR_NODEFILE}} | ||
Create a nodefile with a single entry per node: | Create a nodefile with a single entry per node: | ||
{{Term|location=frontend|cmd=<code class="command">uniq</code> $OAR_NODEFILE > nodes}} | {{Term|location=frontend|cmd=<code class="command">uniq</code> $OAR_NODEFILE > nodes}} | ||
Copy it to the first node: | Copy it to the first node: | ||
{{Term|location=frontend|cmd=<code class="command">scp</code> nodes mpi@ | {{Term|location=frontend|cmd=<code class="command">scp</code> nodes mpi@`head -1 nodes`:}} | ||
connect to the first node: | connect to the first node: | ||
{{Term|location=frontend|cmd=<code class="command">ssh</code> mpi@ | {{Term|location=frontend|cmd=<code class="command">ssh</code> mpi@`head -1 nodes`}} | ||
{{Term|location=node|cmd=<code class="command">cd</code> | {{Term|location=node|cmd=<code class="command">cd</code> $HOME/src/mpi/NetPIPE-3.7.1}} | ||
{{Term|location=node|cmd=<code class="command">/usr/local/bin/mpirun</code> -machinefile ~/nodes NPmpi}} | {{Term|location=node|cmd=<code class="command">/usr/local/bin/mpirun</code> -machinefile ~/nodes NPmpi}} | ||
Line 356: | Line 377: | ||
Choose three clusters from 3 different sites. | Choose three clusters from 3 different sites. | ||
{{Term|location=frontend|cmd=<code class="command">oargridsub</code> -t deploy -w 02:00:00 <code class="replace">cluster1</code>:rdef="nodes=2",<code class="replace">cluster2</code>:rdef="nodes=2",<code class="replace">cluster3</code>:rdef="nodes=2" > oargrid.out}} | {{Term|location=frontend|cmd=<code class="command">oargridsub</code> -t deploy -w 02:00:00 <code class="replace">cluster1</code>:rdef="nodes=2",<code class="replace">cluster2</code>:rdef="nodes=2",<code class="replace">cluster3</code>:rdef="nodes=2" > oargrid.out}} | ||
{{Term|location=frontend|cmd=<code class="command">export</code> OARGRID_JOB_ID= | {{Term|location=frontend|cmd=<code class="command">export</code> OARGRID_JOB_ID=`grep "Grid reservation id" oargrid.out | cut -f2 -d=`}} | ||
Get the node list using oargridstat: | Get the node list using oargridstat: | ||
{{Term|location=frontend|cmd=<code class="command">oargridstat</code> -w -l $OARGRID_JOB_ID |grep grid > ~/gridnodes}} | {{Term|location=frontend|cmd=<code class="command">oargridstat</code> -w -l $OARGRID_JOB_ID |grep grid > ~/gridnodes}} | ||
Line 363: | Line 384: | ||
Deploy on all sites using the --multi-server option : | Deploy on all sites using the --multi-server option : | ||
{{Term|location=frontend|cmd=<code class="command">kadeploy3</code> -f gridnodes -a $HOME/public/ | {{Term|location=frontend|cmd=<code class="command">kadeploy3</code> -f gridnodes -a $HOME/public/wheezy-openmpi.dsc -k --multi-server -o ~/nodes.deployed}} | ||
{{Term|location=frontend|cmd=<code class="command">scp</code> ~/nodes.deployed mpi@ | {{Term|location=frontend|cmd=<code class="command">scp</code> ~/nodes.deployed mpi@`head -1 ~/nodes.deployed`:}} | ||
connect to the first node: | connect to the first node: | ||
{{Term|location=frontend|cmd=<code class="command">ssh</code> mpi@ | {{Term|location=frontend|cmd=<code class="command">ssh</code> mpi@`head -1 ~/nodes.deployed`}} | ||
{{Term|location=node|cmd=<code class="command">cd</code> | {{Term|location=node|cmd=<code class="command">cd</code> $HOME/src/mpi/}} | ||
{{Term|location=node|cmd=<code class="command">/usr/local/bin/mpirun</code> -machinefile ~/nodes.deployed --mca btl self,sm,tcp --mca opal_net_private_ipv4 "192.168.7.0/24\;192.168.162.0/24\;192.168.160.0/24\;172.24.192.0/18\;172.24.128.0/18\;192.168.200.0/23" tp}} | {{Term|location=node|cmd=<code class="command">/usr/local/bin/mpirun</code> -machinefile ~/nodes.deployed --mca btl self,sm,tcp --mca opal_net_private_ipv4 "192.168.7.0/24\;192.168.162.0/24\;192.168.160.0/24\;172.24.192.0/18\;172.24.128.0/18\;192.168.200.0/23" tp}} | ||
==Setting up and starting Open MPI on a default environment using allow_classic_ssh== | |||
Submit a job with the <code>allow_classic_ssh</code> type: | |||
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -t allow_classic_ssh -l nodes=3}} | |||
Launch your parallel job: | |||
{{Term|location=node|cmd=<code class="command">mpirun</code> -machinefile $OAR_NODEFILE $HOME/src/mpi/tp}} | |||
== MPICH2 == | == MPICH2 == | ||
Line 381: | Line 409: | ||
Then you can use a script like this to launch mpd/mpirun: | Then you can use a script like this to launch mpd/mpirun: | ||
NODES= | NODES=`uniq < $OAR_NODEFILE | wc -l | tr -d ' '` | ||
NPROCS= | NPROCS=`wc -l < $OAR_NODEFILE | tr -d ' '` | ||
mpdboot --rsh=oarsh --totalnum=$NODES --file=$OAR_NODEFILE | mpdboot --rsh=oarsh --totalnum=$NODES --file=$OAR_NODEFILE | ||
sleep 1 | sleep 1 | ||
mpirun -n $NPROCS <code class="replace">mpich2binary</code> | mpirun -n $NPROCS <code class="replace">mpich2binary</code> |
Revision as of 17:36, 27 January 2016
Note | |
---|---|
This page is actively maintained by the Grid'5000 team. If you encounter problems, please report them (see the Support page). Additionally, as it is a wiki page, you are free to make minor corrections yourself if needed. If you would like to suggest a more fundamental change, please contact the Grid'5000 team. |
Running MPI on Grid'5000
When attempting to run MPI on Grid'5000 you'll be faced with a number of challenges, ranging from classical setup problems for MPI software to problems specific to Grid'5000. This practical session aims at driving you through the most common use cases, which are:
- Setting up and starting Open MPI on a default environment using
oarsh
. - Setting up and starting Open MPI to use high performance interconnect.
- Setting up and starting Open MPI to run on several sites using
oargridsub
. - Setting up and starting Open MPI on a default environment using a
allow_classic_ssh
. - Setting up and starting Open MPI on a kadeploy image.
Several implementation of MPI exist: Open MPI, MPICH2, MPICH, LAM, etc.
In this practical session, we will focus on Open MPI.
Pre-requisite
- Basic knowledge of MPI; if you don't know MPI, you can read: Grid_computation. For a more comprehensive tutorials on MPI, see IDRIS courses on MPI.
- Knowledge of OAR (Getting Started tutorial), and for the second part of this tutorial, basic knowledge of OARGRID (Advanced OAR tutorial) and Kadeploy (Getting Started tutorial).
Overview
Since june 2010 the same default environment is available on every sites, therefore, you can use the default MPI library available on this environment (Open MPI 1.4.5).
Using Open MPI on a default environment
Create a sample MPI program
- We will use a very basic MPI program to test OAR/MPI. Create a file
$HOME/src/mpi/tp.c
and copy the following source code:
the source code:
#include <stdio.h>
#include <mpi.h>
#include <time.h> /* for the work function only */
int main (int argc, char *argv []) {
char hostname[257];
int size, rank;
int i, pid;
int bcast_value = 1;
gethostname (hostname, sizeof hostname);
MPI_Init (&argc, &argv);
MPI_Comm_rank (MPI_COMM_WORLD, &rank);
MPI_Comm_size (MPI_COMM_WORLD, &size);
if (!rank) {
bcast_value = 42;
}
MPI_Bcast (&bcast_value,1 ,MPI_INT, 0, MPI_COMM_WORLD );
printf("%s\t- %d - %d - %d\n", hostname, rank, size, bcast_value);
fflush(stdout);
MPI_Barrier (MPI_COMM_WORLD);
MPI_Finalize ();
return 0;
}
This program uses MPI to communicate between processes; the MPI process of rank 0 will broadcast an integer (value 42) to all the other processes. Then, each process prints its rank, the total number of processes, and the value he got from process 0.
Setting up and starting Open MPI on a default environment using oarsh
Submit a job:
Compile your code:
oarsh
is the default connector used when you reserve a node. To be able to use this connector, you need to add the option --mca plm_rsh_agent "oarsh"
to mpirun.
You can also set an environment variable (usually in your .bashrc):
Open MPI also provides a config file solution. In your home, create a file as ~/.openmpi/mca-params.conf
plm_rsh_agent=oarsh filem_rsh_agent=oarcp
You should have something like:
helios-52 - 4 - 12 - 42 helios-51 - 0 - 12 - 42 helios-52 - 5 - 12 - 42 helios-51 - 2 - 12 - 42 helios-52 - 6 - 12 - 42 helios-51 - 1 - 12 - 42 helios-51 - 3 - 12 - 42 helios-52 - 7 - 12 - 42 helios-53 - 8 - 12 - 42 helios-53 - 9 - 12 - 42 helios-53 - 10 - 12 - 42 helios-53 - 11 - 12 - 42
You may have (lot's of) warning messages if Open MPI doesn't find high performance hardware: don't be afraid, it's normal but you could use FAQ#How_to_use_MPI_in_Grid5000.3F to avoid them. This can looks like this:
[[2616,1],2]: A high-performance Open MPI point-to-point messaging module was unable to find any relevant network interfaces: Module: OpenFabrics (openib) Host: helios-8.sophia.grid5000.fr Another transport will be used instead, although this may result in lower performance. -------------------------------------------------------------------------- warning:regcache incompatible with malloc warning:regcache incompatible with malloc warning:regcache incompatible with malloc
or like this:
[griffon-80.nancy.grid5000.fr:04866] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_mtl_mx: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [griffon-80.nancy.grid5000.fr:04866] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_btl_mx: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [griffon-80.nancy.grid5000.fr:04865] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_mtl_mx: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [griffon-80.nancy.grid5000.fr:04867] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_mtl_mx: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) ...
exit the job.
Setting up and starting Open MPI to use high performance interconnect
By default, Open MPI tries to use any high performance interconnect he can find. But it works only if the related libraries were found during the compilation of Open Mpi (not during the compilation of your application). It should work if you built Open MPI on a wheezy-x64 environment, and it also works correctly on the default environment.
We will be using the Netpipe tool to check if the high performance interconnect is really used: download it from this URL: http://pkgs.fedoraproject.org/repo/pkgs/NetPIPE/NetPIPE-3.7.1.tar.gz/5f720541387be065afdefc81d438b712/NetPIPE-3.7.1.tar.gz
Note | |
---|---|
Think to configure proxy if |
Unarchive Netpipe
Compile
Infiniband hardware :
Infiniband hardware is available on severals sites (see Hardware page):
- Rennes (20G)
- Nancy (20G)
- Grenoble (20G & 40G)
To reserve one core on two nodes with a 10G infiniband interconnect:
or for 20G:
To test the network:
To check if the support for infiniband is available in Open MPI, run:
you should see something like this:
MCA btl: openib (MCA v2.0, API v2.0, Component v1.4.1)
With Infiniband 40G (QDR), you should have much better performance that using ethernet or Myrinet 2G:
0: 1 bytes 30716 times --> 4.53 Mbps in 1.68 usec 1: 2 bytes 59389 times --> 9.10 Mbps in 1.68 usec ... 121: 8388605 bytes 17 times --> 25829.13 Mbps in 2477.82 usec 122: 8388608 bytes 20 times --> 25841.35 Mbps in 2476.65 usec 123: 8388611 bytes 20 times --> 25823.40 Mbps in 2478.37 usec
Less than 2 microsec in latency and almost 26Gbit/s in bandwitdh !
Myrinet hardware :
Myrinet hardware is available on severals sites (see Hardware page):
- Lille (10G)
- Sophia (10G)
To reserve one core on two nodes with a 10G Myrinet interconnect:
Myrinet 10G:
To test the network:
you should have something like that:
0: 1 bytes 4080 times --> 0.31 Mbps in 24.40 usec 1: 2 bytes 4097 times --> 0.63 Mbps in 24.36 usec ... 122: 8388608 bytes 3 times --> 896.14 Mbps in 71417.13 usec 123: 8388611 bytes 3 times --> 896.17 Mbps in 71414.83 usec
The minimum latency is given by the last column for a 1 byte message; the maximum throughput is given by the last line, 896.17 Mbps in this case. So in this case a latency of 24usec is very high, therefore the myrinet was not used as expected. It can happen if Open MPI has not found the mx libraries during the compilation. You can check this with :
If the output is empty, there is no mx support builtin.
With a myrinet2G network, typical result should looks like this:
This time we have:
0: 1 bytes 23865 times --> 2.03 Mbps in 3.77 usec 1: 2 bytes 26549 times --> 4.05 Mbps in 3.77 usec ... 122: 8388608 bytes 3 times --> 1773.88 Mbps in 36079.17 usec 123: 8388611 bytes 3 times --> 1773.56 Mbps in 36085.69 usec
In this example, we have 3.77usec and almost 1.8Gbps
More advanced use cases
Running MPI on several sites at once
In this tutorial, we use the following sites: rennes, sophia and grenoble. But you can now use any site. For running MPI applications on several site, we will be using oargrid. See Grid_jobs_management tutorial for more information.
Note | |
---|---|
For multiple sites, we may want to only use tcp , and not native mx and native infiniband; to do this, add this option to mpirun: --mca btl self,sm,tcp |
Synchronize the src/mpi directory, from the frontend (the tp binary must be available on all sites), to the two other sites. Here we supposed we are connected on sophia, and we want to synchronize to grenoble and rennes.
Reserve nodes on the 3 sites with oargridsub (you can reserve nodes from specific clusters if you want to).
frontend :
|
oargridsub -w 02:00:00 rennes :rdef="nodes=2",grenoble :rdef="nodes=2",sophia :rdef="nodes=2" > oargrid.out |
Get the oargrid Id and Job key from the output of oargridsub:
Get the node list using oargridstat and copy the list to the first node:
Connect to the first node:
And run your MPI application:
node :
|
mpirun -machinefile ~/gridnodes --mca plm_rsh_agent "oarsh" --mca opal_net_private_ipv4 "192.168.160.0/24\;192.168.14.0/23" --mca btl_tcp_if_exclude ib0,lo,myri0 --mca btl self,sm,tcp tp |
Compilation [optional]
If you want to use a custom version of Open MPI, you can compile it in your home directory.
- Make an interactive reservation and compile Open MPI on a node :
- Get Open MPI (or here: http://www.open-mpi.org/software/ompi/v1.4/)
frontend :
|
export http_proxy=http://proxy:3128/ |
Unarchive Open MPI
configure (wait ~1mn30s)
The source tarball includes a small patch to make Open MPI works on grid5000 on several sites simultaneously. For the curious, the patch is:
--- ompi/mca/btl/tcp/btl_tcp_proc.c.orig 2010-03-23 14:01:28.000000000 +0100 +++ ompi/mca/btl/tcp/btl_tcp_proc.c 2010-03-23 14:01:50.000000000 +0100 @@ -496,7 +496,7 @@ local_interfaces[i]->ipv4_netmask)) { weights[i][j] = CQ_PRIVATE_SAME_NETWORK; } else { - weights[i][j] = CQ_PRIVATE_DIFFERENT_NETWORK; + weights[i][j] = CQ_NO_CONNECTION; } best_addr[i][j] = peer_interfaces[j]->ipv4_endpoint_addr; }
and compile: (wait ~2mn30s)
- Install it on your home directory (in $HOME/openmpi/ )
Then you can do the same steps as before, but with $HOME/openmpi/bin/mpicc
and $HOME/openmpi/bin/mpirun
Setting up and starting Open MPI on a kadeploy image
Building a kadeploy image
The default Open MPI version available in debian based distributions is not compiled with high performances libraries like myrinet/MX, therefore we must recompile Open MPI from sources. Fortunately, every default image (wheezy-x64-XXX) but the min variant includes the libraries for high performance interconnects, and Open MPI will find them at compile time.
We will create a kadeploy image based on an existing one.
Download Open MPI tarball if you don't already have it:
frontend :
|
export http_proxy=http://proxy:3128/ |
Copy Open MPI tarball on the first node:
Then connect on the deployed node as root, and install openmpi:
Unarchive Open MPI:
Install gfortran, f2c and blas library:
Configure and compile:
Create a dedicated user named mpi, in the group rdma (for infiniband)
useradd -m -g rdma mpi -d /var/mpi echo "* hard memlock unlimited" >> /etc/security/limits.conf echo "* soft memlock unlimited" >> /etc/security/limits.conf mkdir ~mpi/.ssh cp ~/.ssh/authorized_keys ~mpi/.ssh chown -R mpi ~mpi/.ssh su - mpi mkdir src ssh-keygen -N "" -P "" -f /var/mpi/.ssh/id_rsa cat .ssh/id_rsa.pub >> ~/.ssh/authorized_keys echo " StrictHostKeyChecking no" >> ~/.ssh/config exit exit rsync -avz ~/src/mpi/ mpi@`head -1 $OAR_NODEFILE`:src/mpi/ ssh root@`head -1 $OAR_NODEFILE`
Create the image using tgz-g5k:
Disconnect from the node (exit). From the frontend, copy the image to the public directory:
Copy the description file of wheezy-x64-base:
frontend :
|
grep -v visibility /grid5000/descriptions/wheezy-x64-base-1.4.dsc > $HOME/public/wheezy-openmpi.dsc
|
Change the image name in the description file; we will use an http URL for multi-site deploiement:
perl -i -pe "s@server:///grid5000/images/wheezy-x64-base-1.4.tgz@http://public.$(hostname | cut -d. -f2).grid5000.fr/~$USER/wheezy-openmpi.tgz@" $HOME/public/wheezy-openmpi.dsc
Now you can terminate the job:
Using a kadeploy image
Single site
connect to the first node:
Single site with Myrinet hardware
Create a nodefile with a single entry per node:
Copy it to the first node:
connect to the first node:
This time we have:
0: 1 bytes 23865 times --> 2.03 Mbps in 3.77 usec 1: 2 bytes 26549 times --> 4.05 Mbps in 3.77 usec ... 122: 8388608 bytes 3 times --> 1773.88 Mbps in 36079.17 usec 123: 8388611 bytes 3 times --> 1773.56 Mbps in 36085.69 usec
This time we have 3.77usec, which is good, and almost 1.8Gbps. We are using the myrinet interconnect!
Multiple sites
Choose three clusters from 3 different sites.
frontend :
|
oargridsub -t deploy -w 02:00:00 cluster1 :rdef="nodes=2",cluster2 :rdef="nodes=2",cluster3 :rdef="nodes=2" > oargrid.out |
Get the node list using oargridstat:
Deploy on all sites using the --multi-server option :
frontend :
|
kadeploy3 -f gridnodes -a $HOME/public/wheezy-openmpi.dsc -k --multi-server -o ~/nodes.deployed |
connect to the first node:
Setting up and starting Open MPI on a default environment using allow_classic_ssh
Submit a job with the allow_classic_ssh
type:
Launch your parallel job:
MPICH2
Warning | |
---|---|
This documentation is about using MPICH2 with the MPD process manager. But the default process manager for MPICH2 is now Hydra. See also: The MPICH documentation. |
If you want/need to use MPICH2 on Grid5000, you should do this:
First, you have to do this once (on each site)
Then you can use a script like this to launch mpd/mpirun:
NODES=`uniq < $OAR_NODEFILE | wc -l | tr -d ' '`
NPROCS=`wc -l < $OAR_NODEFILE | tr -d ' '`
mpdboot --rsh=oarsh --totalnum=$NODES --file=$OAR_NODEFILE
sleep 1
mpirun -n $NPROCS mpich2binary