Submitit: Difference between revisions

From Grid5000
Jump to navigation Jump to search
Line 15: Line 15:
{{Term|location=flille|cmd=<code class="command">pip install git+https://github.com/facebookincubator/submitit@main#egg=submitit</code>}}
{{Term|location=flille|cmd=<code class="command">pip install git+https://github.com/facebookincubator/submitit@main#egg=submitit</code>}}


{{Note|text=Dask-jobqueue should be installed in a folder accessible by both the frontend and the nodes, e.g. your homedir.
{{Note|text=Submitit should be installed in a folder accessible by both the frontend and the nodes, e.g. your homedir.
}}
}}



Revision as of 16:56, 16 January 2023

Submitit

Submitit is a lightweight tool for submitting Python functions for computation within a Slurm cluster. It basically wraps submission and provide access to results, logs and more. Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Submitit allows to switch seamlessly between executing on Slurm or locally. Development is in progress for an OAR plugin, to facilitate the passage between OAR and Slurm based resource managers. Source code, issues and pull requests can be found here.

Basic usage

Submitit installation

pip can be used to install the stable release of submitit:

Terminal.png flille:
pip install submitit

otherwise, conda can be used to install submitit from the conda-forge:

Terminal.png flille:
conda install -c conda-forge submitit

an installation from Source can also be used to get the latest version on the main branch:

Note.png Note

Submitit should be installed in a folder accessible by both the frontend and the nodes, e.g. your homedir.

Performing an addition with Submitit

Here is a Python script example which allows to execute an addition job on Slurm, OAR or locally.

import submitit
from operator import truediv

def add(a, b):
    return a + b

# logs are dumped in the folder
executor = submitit.AutoExecutor(folder="log_test")

job_addition = executor.submit(add, 5, 7)  # will compute add(5, 7)
output = job_addition.result()  # waits for completion and returns output
print('job_addition output: ', output)
assert output == 12

The example script can be launched on frontend as follow:

Terminal.png flille:
python3 this-script.py

The addition job will be computed on the cluster. For each job, in the working folder that you defined (e.g., folder="log_test"), you will find a stdout log file jobId_log.out, a stderr log file jobId_log.err, a submission batch file jobId_submission.sh, a task file jobId_submitted.pkl and an output file jobId_result.pkl.

Advanced usage

Parameters and configuration

Parameters for cluster can be setted by update_parameters(**kwargs).

The AutoExecutor shown in the basic usage example above is the common submission interface, for OAR/Slurm clusters and for local jobs.

To use the cluster specific options with the AutoExecutor, they must be appended by the cluster name, e.g., slurm_partition="cpu_devel", oar_queue="default". These cluster specific options will be ignored on other clusters.

executor = submitit.AutoExecutor(folder="log_test")
executor.update_parameters(slurm_partition="cpu_devel", oar_queue="default")

Otherwise, cluster specific options can also be used with cluster specific Executor, without the cluster name prefixes, e.g., SlurmExecutor, OarExecutor.

executor = submitit.OarExecutor(folder="log_test")
executor.update_parameters(walltime="0:0:5", queue="default")

The following table recaps the parameters supported by AutoExecutor, OARExecutor and SlurmExecutor:

AutoExecutor OARExecutor SlurmExecutor Description
timeout_min oar_walltime Example Example
name Example Example Example
nodes oar_nodes Example Example
- oar_queue slurm_partition Example
Example Example Example Example
Example Example Example Example
stderr_to_stdout not supported Example Example
tasks_per_node not supported Example Example
cpus_per_task not supported Example Example
mem_gb not supported Example Example

Comparison with Dask-jobqueue

The key difference with Submitit is that Dask-jobqueue distributes the jobs to a pool of Dask workers, while Submitit jobs are directly jobs on the cluster. In that sense Submitit is a lower level interface than Dask-jobqueue and you get more direct control over your jobs, including individual stdout and stderr, and possibly checkpointing in case of preemption and timeout. On the other hand, you should avoid submitting multiple small tasks with Submitit, which would create many independent jobs and possibly overload the cluster, while you can do it without any problem through Dask-jobqueue.