Submitit: Difference between revisions
Line 32: | Line 32: | ||
conda can be used to install submitit from the conda-forge: | conda can be used to install submitit from the conda-forge: | ||
{{Term|location=fontend.site|cmd=<code class="command">conda install -c conda-forge submitit</code>}} | {{Term|location=fontend.site|cmd=<code class="command">conda install -c conda-forge submitit</code>}} | ||
To use the last version including the OAR plugin, you can create a conda environment | To use the last version including the OAR plugin, you can create a conda environment file(e.g. "conda-env-submitit.yml") as: | ||
<syntaxhighlight line> | |||
{{Term|location=fontend.site|cmd=<code class="command">conda | name: submitit | ||
{{Term|location=fontend.site|cmd=<code class="command"> | dependencies: | ||
- pip: | |||
- git+https://gitlab.inria.fr/moyens-de-calcul/submitit.git@master#egg=submitit | |||
</syntaxhighlight> | |||
and then install the last version of Submitit using this environment file | |||
{{Term|location=fontend.site|cmd=<code class="command">conda env create --file conda-env-submitit.yml</code>}} | |||
{{Term|location=fontend.site|cmd=<code class="command">source activate submitit</code>}} | |||
== Basic usage == | == Basic usage == |
Revision as of 15:01, 22 February 2023
Submitit
Submitit is a lightweight tool for submitting Python functions for computation within a Slurm cluster. It basically wraps submission and provide access to results, logs and more. Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Submitit allows to switch seamlessly between executing on Slurm or locally. Source code, issues and pull requests can be found here.
Currently, development is in progress for an OAR plugin, to facilitate the passage between OAR and Slurm based resource managers. Source code for an OAR plugin can be found here. Fort the first version, the supported and not supported parameters are listed in the table below. The not supported functionalities are the tasks notion of Slurm, the memory management of the job, the checkpointing, the job array and the asynchronous job supports.
Comparison with Dask-jobqueue
The key difference with Submitit is that Dask-jobqueue distributes the jobs to a pool of Dask workers, while Submitit jobs are directly jobs on the cluster. In that sense Submitit is a lower level interface than Dask-jobqueue and you get more direct control over your jobs, including individual stdout and stderr, and possibly checkpointing in case of preemption and timeout. On the other hand, you should avoid submitting multiple small tasks with Submitit, which would create many independent jobs and possibly overload the cluster, while you can do it without any problem through Dask-jobqueue.
Submitit installation
Using pip
pip can be used to install the stable release of submitit:
To use the last version including the OAR plugin, an installation from Source can be done:
fontend.site :
|
pip install --user git+https://gitlab.inria.fr/moyens-de-calcul/submitit.git@master#egg=submitit |
It is recommended to install python dependencies via a virtual environment. To do so, before running the pip
command:
Using conda
conda can be used to install submitit from the conda-forge:
To use the last version including the OAR plugin, you can create a conda environment file(e.g. "conda-env-submitit.yml") as:
name: submitit
dependencies:
- pip:
- git+https://gitlab.inria.fr/moyens-de-calcul/submitit.git@master#egg=submitit
and then install the last version of Submitit using this environment file
Basic usage
Performing an addition with Submitit
Here is a Python script example which allows to execute an addition job on Slurm, OAR or locally.
import submitit
def add(a, b):
return a + b
# logs are dumped in the folder
executor = submitit.AutoExecutor(folder="log_test")
job_addition = executor.submit(add, 5, 7) # will compute add(5, 7)
output = job_addition.result() # waits for completion and returns output
print('job_addition output: ', output)
assert output == 12
The example script can be launched on frontend as follow:
The addition job will be computed on the cluster. For each job, in the working folder that you defined (e.g., folder="log_test"), you will find a stdout log file jobId_log.out, a stderr log file jobId_log.err, a submission batch file jobId_submission.sh, a task file jobId_submitted.pkl and an output file jobId_result.pkl.
Advanced usage
Parameters
Parameters for cluster can be setted by update_parameters(**kwargs).
The AutoExecutor shown in the basic usage example above is the common submission interface, for OAR/Slurm clusters and local jobs.
To use the cluster specific parameters with the AutoExecutor, they must be appended by the cluster name, e.g., slurm_partition="cpu_devel", oar_queue="default". These cluster specific options will be ignored on other clusters.
executor = submitit.AutoExecutor(folder="log_test")
executor.update_parameters(slurm_partition="cpu_devel", oar_queue="default")
E.g. if both oar_walltime and timeout_min are provided, then:
- oar_walltime is used on the OAR cluster
- timeout_min is used on other clusters
The cluster specific parameters can also be used with cluster specific Executors, without the cluster name prefixes, e.g., SlurmExecutor, OarExecutor.
executor = submitit.OarExecutor(folder="log_test")
executor.update_parameters(walltime="0:0:5", queue="default")
The following table recaps the parameters supported by AutoExecutor, OARExecutor and SlurmExecutor:
AutoExecutor | OARExecutor | SlurmExecutor | Description |
---|---|---|---|
timeout_min | walltime in hh:mm:ss | time | timeout in minutes |
name | n | job_name | 'submitit' by default |
nodes | nodes | nodes | number of nodes in int |
oar_queue | slurm_partition | string | |
gpus_per_node | gpu | gpus_per_node or --gres=gpu:xx | number of gpu in int |
stderr_to_stdout | not supported | stderr_to_stdout | boolean |
tasks_per_node | not supported | ntasks_per_node | int |
cpus_per_task | not supported | cpus_per_task | int |
mem_gb | not supported | mem | string |
Checkpointing with Submitit
Install scikit-learn (numpy, scipy required)