GNU Parallel: Difference between revisions
Line 37: | Line 37: | ||
; Multiple nodes | ; Multiple nodes | ||
Within a OAR job of many nodes (hosts), the user needs to tell GNU Parallel how to remotely execute the tasks on the nodes reserved in the OAR job, from the head node (where the ''parallel'' command is to be run). | Within a OAR job of many nodes (hosts), the user needs to tell GNU Parallel how to remotely execute the tasks on the nodes reserved in the OAR job, from the head node (where the ''parallel'' command is to be run). | ||
# It has provide the list of target nodes to execute on, to pass to the GNU Parallel ''--slf'' option. | # It has provide the list of target nodes to execute on, to pass to the GNU Parallel ''--slf'' option. It is possible to use the OAR node file $OAR_NODEFILE. Note that this file contains as many lines with a node name as the count of cores of that node. Having these duplicates is note an issue for GNU Parallel as it preprocess this file with a uniq. | ||
# It has to use the oarsh connector (unless the ''-t allow_classic_ssh'' OAR job type was used), by passing it to the GNU Parallel ''--ssh'' option. | # It has to use the oarsh connector (unless the ''-t allow_classic_ssh'' OAR job type was used), by passing it to the GNU Parallel ''--ssh'' option. | ||
Revision as of 10:17, 7 April 2020
This page descibes the use of GNU Parallel on Grid'5000.
Quoting GNU Parallel website:
GNU parallel is a shell tool for executing jobs in parallel using one or more computers. A job can be a single command or a small script that has to be run for each of the lines in the input. The typical input is a list of files, a list of hosts, a list of users, a list of URLs, or a list of tables. A job can also be a command that reads from a pipe. GNU parallel can then split the input and pipe it into commands in parallel.
For a more general and complete information, see the GNU Parallel website.
We details in this page Grid'5000 specific information in order to let you take benefit from the tool on the platform.
About the GNU Parallel version installed in Grid'5000
The version of GNU Parallel installed on Grid'5000 nodes comes from the Debian's official packaging.
It is a rather old version, but it seems sufficient.
Whenever one would need a more recent version, one can get the tarball provided at http://ftp.gnu.org/gnu/parallel/parallel-latest.tar.bz2, and install it in one's home directoy. This is straightforward (e.g. ./configure --prefix=$HOME/parallel && make install).
(An environment module could be provided if requested by some users.)
Benefit from using GNU Parallel in Grid5000
While OAR is the Resource and Job Management System of Grid'5000 and supports the management of batch of jobs, its use may be overkill to handle SPMD parallel executions of small tasks within a larger reservation. In concrete terms, a user may create a first OAR job in order to book a large set of resources for some time (e.g. for the night), and then have to submit a batch of many small tasks (e.g. each using only one core) within that first job.
To that purpose, using OAR container for the first job, then OAR inner jobs for the small tasks is overkill. (But note that using OAR container and inner jobs makes sense when all jobs are not from the same user, for tutorials for instance).
We strongly advise to use GNU parallel to handle the execution of the small tasks within the initial OAR reservation of resources. That means only create one OAR job to book the large set of resources (not using the container job type), then within this job, use GNU Parallel.
Note that using GNU Parallel to handle the small tasks, the OAR restriction (e.g. max 200 jobs in queue) will not apply.
How to use GNU Parallel in Grid'5000
GNU Parallel must be used within a OAR job: GNU Parallel does not book resources, it just manages the concurrent/parallel executions of tasks on already reserved resources.
- Single node
Within a OAR job of only 1 node (host), there is nothing specific to Grid'5000 to know in the usage of GNU parallel, in order to exploit all the cores of the node. Just run the parallel command in the job shell. See GNU parallel documentation, or manual page for more information.
- Multiple nodes
Within a OAR job of many nodes (hosts), the user needs to tell GNU Parallel how to remotely execute the tasks on the nodes reserved in the OAR job, from the head node (where the parallel command is to be run).
- It has provide the list of target nodes to execute on, to pass to the GNU Parallel --slf option. It is possible to use the OAR node file $OAR_NODEFILE. Note that this file contains as many lines with a node name as the count of cores of that node. Having these duplicates is note an issue for GNU Parallel as it preprocess this file with a uniq.
- It has to use the oarsh connector (unless the -t allow_classic_ssh OAR job type was used), by passing it to the GNU Parallel --ssh option.
Typical usage and some examples
Typical coupling of GNU Parallel and OAR in a multi-node reservation
- Create a OAR job of 10 nodes
We create an interactive job for this example, so that the command below are executed in the opened job shell.
But Note that all this can be scripted and passed to the oarsub command of a non-interactive OAR job.
- Run parallel with the --ssh and --sshlloginfile options
Note | |
---|---|
GNU Parallel has many, many, many features to pass |
Exemple 1: illustrating the execution of the GNU Parallel tasks in a gantt chart
In this example we will build a gantt diagram showing a timeline of execution of GNU Parallel tasks, within a OAR job.
Here is an example of such a diagram. It illustrates the GNU Parallel tasks dispatch on the resources which were reserved previously in a OAR job. This somehow completes the OAR drawgantt display, by showing what is happening inside the OAR job.
Please find below the steps to build such a diagram.
- Initiate your workspace
Our workspace is in the nantes frontend (any other site could do as well), in our ~/public directory, because we will open the gantt chart in the web browser at the end.
mkdir -p ~/public/parallel
cd ~/public/parallel
- Create the GNU Parallel task script "task.sh"
This is where the actual user program (e.g. numerical simulation, ...) is to be placed. Here for the example, we just basically sleep for a random time, from 4 to 25s. Parameter of the user program ($@) will be passed by GNU Parallel.
#!/bin/bash
DURATION=$(($RANDOM % 22 + 4))
echo Runing "$0 with params $@ on $(hostname)... (task will run for $DURATION s)."
sleep $DURATION
Here, for the example, we just basically sleep for a random time, from 4 to 25s.
- Create the OAR script "oarjob.sh"
#!/bin/bash
[ -z "$OAR_NODEFILE" ] && echo "Error: Not in a OAR job" 1>&2 && exit 1
cd ${0%/*}
TASK_COUNT=${1:-10}
seq $TASK_COUNT | parallel --joblog $PWD/parallel.log --bar --ssh oarsh --sshloginfile $OAR_NODEFILE --use-cpus-instead-of-cores $PWD/task.sh
Here we simply pass to parallel a list of parameters in its standard input, using the seq command. Thus, parameters are intergers from 1 to TASK_COUNT: TASK_COUNT tasks will be executed, each with as parameter one of those intergers. See GNU Parallel documentation for more details and other options.
We used the --use-cpus-instead-of-cores option of GNU Parallel, so that it will only run 2 tasks at a time on each node.
The script will start the GNU Parallel command from the head node of the OAR job, and generate the "parallel.log" file.
- Create the html/javascript code to render the gantt chart of the tasks in "ganttchart.html"
The parallel.log file is generated while GNU Parallel is executing the tasks. We create the ganttchart.html web page to render the gratt chart from it.
<html>
<head>
<script type="text/javascript" src="https://www.gstatic.com/charts/loader.js"></script>
<script type="text/javascript">
google.charts.load("current", {packages:["timeline"]});
google.charts.setOnLoadCallback(getLogs);
function getLogs(){
var request = new XMLHttpRequest();
request.open('GET', './parallel.log?_=' + new Date().getTime(), true); //force cache miss
request.responseType = 'text';
request.send(null);
request.onreadystatechange = function () {
if (request.readyState === 4 && request.status === 200) {
var lines = request.responseText.split('\n').filter(function(line) { return line.length > 0; });
lines.shift();
var data = lines.map(function(line) {
a = line.split('\t');
return [ a[1], "Task "+a[0], new Date(a[2]*1000), new Date(a[2]*1000+a[3]*1000) ];
});
data.sort((a, b) => (a[0] > b[0]) ? 1 : ((a[0] < b[0]) ? -1 : ((a[1] > b[1]) ? 1 : -1)));
var container = document.getElementById('gantt');
var chart = new google.visualization.Timeline(container);
var dataTable = new google.visualization.DataTable();
dataTable.addColumn({ type: 'string', id: 'Ressource' });
dataTable.addColumn({ type: 'string', id: 'JobId' });
dataTable.addColumn({ type: 'date', id: 'Start' });
dataTable.addColumn({ type: 'date', id: 'End' });
dataTable.addRows(data);
chart.draw(dataTable);
setTimeout(getLogs, 3000);
}
}
}
</script>
</head>
<body>
<div id="gantt" style="height: 100%;"></div>
</body>
</html>
- Submit the OAR job named OAR2parallel requesting 8 nodes of the ecotype cluster, and executing the oarjob.sh script.
oarsub -n OAR2parallel -l nodes=8 -p "cluster='ecotype'" "./oarjob.sh 80"
80 tasks to run: roughly 10 tasks per node, 5 per CPUs (bi-CPU nodes), but let's confront with to the actual task dispatch in the gantt chart below !
- Open the ganttchart.html web page
Open the web page in a web browser on your workstation:
- either using the Grid'5000 VPN, so that your can access the internal URL at: http://public.nantes.grid5000.fr/~YOUR_LOGIN/parallel/ganttchart.html
- or using the api reverse proxy service, at : https://api.grid5000.fr/stable/grid5000/sites/nantes/public/YOUR_LOGIN/parallel/ganttchart.html
(Mind replacing YOUR_LOGIN by your actual Grid'5000 user login).
The gantt chart will show up and refresh every 3 seconds while GNU Parallel executes tasks. This should be very alike the diagram shown at the beginning of this section.
- Show the progress
GNU Parallel progress (--bar option) is written to the OAR job stderr file, we can look at it in a loop:
OAR_JOB_ID=$(oarstat -u | grep OAR2parallel | cut -f 1 -d\ ) \
while oarstat -u | grep -q $OAR_JOB_ID; do
sleep 1 && [ -r OAR.OAR2parallel.$OAR_JOB_ID.stderr ] && cat OAR.OAR2parallel.$OAR_JOB_ID.stderr
done
28% 23:57=47s 39
Wait until it reaches 100% and the OAR job finishes.
Exemple 2: Finding the best kernel parameters of a SVM algorithm
Let's consider here a ML problem in which we want to find the best kernel parameters of a SVM algorithm. To do that, we need first to build several training models configured with different kernel parameter values and later compute precision and recall for all such models to find the best one. On this example, we only focus on building all the training models.
We consider a python algorithm training.py
that takes as argument the two kernel parameters for the SVM problem: --gamma
and --c
.
We want to generate the SVM models for:
- gamma values ranging in [0.1, 0.01, 0.001, 0.0001]
- C value ranging nm [1, 10, 100, 1000]
The number of trainings corresponds to the cross-product of C values and gamma values. Considering the previous ranges, we need to build 16 SVM models. We can build the 16 SVM models iteratively but to speed up the process, especially for big data sets, it is better to parallelize and distribute the computation on multiple nodes.
As our SVM algorithm is mono-threaded and the nodes we want to use have 4 cores, we need at lease for 4 nodes to distribute the 16 computations. To do so, we first make an OAR reservation to book the 4 nodes and then we execute the following command to start the 4 computations on each 4 nodes:
node-1 :
|
parallel --ssh oarsh --sshloginfile $OAR_FILE_NODES ./training.py --gamma {1} --c {2} ::: 0.1 0.01 0.001 0.0001 ::: 1 10 100 1000 |
As a result, this command line will execute (the order may be different):
- on node 1
./training.py --gamma 0.1 --c 1
- on node 1
./training.py --gamma 0.1 --c 10
- on node 1
./training.py --gamma 0.1 --c 100
- on node 1
./training.py --gamma 0.1 --c 1000
- on node 2
./training.py --gamma 0.01 --c 1
- on node 2
./training.py --gamma 0.01 --c 10
- on node 2
./training.py --gamma 0.01 --c 100
- on node 2
./training.py --gamma 0.01 --c 1000
- on node 3
./training.py --gamma 0.01 --c 1
- on node 3
./training.py --gamma 0.001 --c 10
- on node 3
./training.py --gamma 0.001 --c 100
- on node 3
./training.py --gamma 0.001 --c 1000
- on node 4
./training.py --gamma 0.001 --c 1
- on node 4
./training.py --gamma 0.001 --c 10
- on node 4
./training.py --gamma 0.001 --c 100
- on node 4
./training.py --gamma 0.001 --c 1000