Dask-jobqueue: Difference between revisions
Line 175: | Line 175: | ||
</syntaxhighlight> | </syntaxhighlight> | ||
The cluster generated a traditional job script and submits it number of times as specified to the job queue. | The cluster generated a traditional job script and submits it number of times as specified to the job queue. | ||
Revision as of 09:48, 3 August 2022
Dask-jobqueue
Dask-jobqueue is a Python library which makes it easy to deploy Dask on common job queuing systems typically found in high performance supercomputers, academic research institutions, and other clusters. Dask is a Python library for parallel computing which scales Python code from multi-core local machines to large distributed clusters in the cloud. Since Dask-jobqueue provides interfaces for OAR and Slurm based clusters, it can be used to facilite the passage between OAR and Slurm based resource managers. Source code, issues and pull requests can be found here.
Installing
pip can be used to install dask-jobqueue and its dependencies:
otherwise, conda can be used to install dask-jobqueue from the conda-forge:
Basic usage
Here is a Python script example which requests for starting a batch script on a well defined resource (2 core, 24GB, at least 1 GPU, specific cluster - chifflet -, for 1 hour)
from dask_jobqueue import OARCluster as Cluster
from dask.distributed import Client
import os
cluster = Cluster(
queue='default',
# Should be specified if you belongs to more than one GGA
project='<your grant access group>',
# cores per job, required parameter
cores=2,
# memory per job, required parameter
memory='24GB',
# walltime for each worker job
walltime='1:0:0',
job_extra=[
'-t besteffort',
# reserve node from specific cluster
'-p chifflet',
# reserve node with at least 1 GPU
'-p "gpu_count >= 1"'
],
# another way to reserve node with GPU
#resource_spec='gpu=1'
)
cluster.scale(1)
client = Client(cluster)
# call your favorite batch script
client.submit(os.system, "./hello-world.sh").result()
client.close()
cluster.close()
The example script can be launched on frontend as follow:
How is Dask-jobqueue interacting with OAR in practice?
In the example above, Dask-jobqueue creates at first a Dask Scheduler in the Python process where the Cluster object is instantiated. It allows you to serve nodes with resource you would like to have. To schedule job(s) on the previously reserved nodes for the computation, you need to tell Dask Scheduler the number of job(s) using the scale command. In the example above, only one job will be launched, with only one worker inside. For advanced usage, please refer to the Advanced usage section.
The OAR job script generated for the example above as follow:
#!/usr/bin/env bash
#OAR -n dask-worker
#OAR -q default
#OAR -l walltime=1:0:0
#OAR -t besteffort
#OAR -p chifflet
#OAR -p "gpu_count >= 1"
/usr/bin/python3 -m distributed.cli.dask_worker tcp://172.16.47.106:39655 --nthreads 1 --nprocs 2 --memory-limit 11.18GiB --name dummy-name --nanny --death-timeout 60 --protocol tcp://
You can see the generated OAR job script by:
print(cluster.job_script())
Advanced usage
Use a configuration file to specify resources
About the resource request, user's configuration can also be specified in ~/.config/dask/jobqueue.yaml file as follow:
jobqueue:
oar:
name: dask-worker
# Dask worker options
cores: 2 # Total number of cores per job
memory: '24GB' # Total amount of memory per job
#processes: 1 # Number of Python processes per job
#interface: null # Network interface to use: eth0 or ib0
death-timeout: 60 # Number of seconds to wait if a worker can not find a scheduler
#local-directory: null # Location of fast local storage like /scratch or $TMPDIR
#extra: [] # Extra arguments to pass to Dask worker
# OAR resource manager options
#shebang: "#!/usr/bin/env bash"
queue: 'default'
#project: null
walltime: '1:00:00'
#env-extra: []
#resource-spec: null
job-extra: []
log-directory: null
# Scheduler options
scheduler-options: {}
The cluster can be then instantiated with one single line as follow:
cluster = OARCluster()
Cluster parameters
dask-jobqueue parameter | OAR command example | Slurm command example | Description |
---|---|---|---|
queue | #OAR -q | #SBATCH -p | Destination queue for each worker job |
project | #OAR --project | #SBATCH -A | Accounting group associated with each worker job |
cores | #OAR -l core=2 | #SBATCH --cpu-per-task=2 | Total cores per job |
memory | #SBATCH --mem=24GB | Total memory per job | |
walltime | #OAR -l walltime=hh:mm:ss | #SBATCH -t hh:mm:ss | Walltime for each worker job |
name | #OAR -n | #SBATCH -J | Name of worker, always set to the default value dask-worker |
resource_spec | #OAR -l host=1/core=2, gpu=1 | Not supported | Request resources and specify job placement |
job_extra | #OAR -O, -E | #SBATCH -o, -e | Log directory |
job_extra | #OAR -p parasilo | #SBATCH -C sirocoo | Property request |
job_extra | #OAR -t besteffort | #SBATCH -t besteffort | Besteffort job |
job_extra | #OAR -r now | #SBATCH --begin=now | Advance reservation |
job_extra | #OAR --checkpoint 150 | #SBATCH --checkpoint 150 | Checkpoint |
job_extra | #OAR -a jobid | #SBATCH --dependency state:jobid | Jobs dependency |
Note: All experiment above is tested on Grid5000, OAR based cluster. Plafrim and Cleps are used as Slurm based clusters to run the same experiment, in order to find the common concepts between OAR and Slurm. Since heterogenities are still observed between Pafrim and Cleps today, the "Slurm command example" column of the table above will be updated when Slurm will be fully supported by the Inria's national computing infrastructure.
Start multiple computations at once using 'scale' parameter
Dask-jobqueue creates a Dask Scheduler in the Python process where the cluster object is instantiated with the defined resource. Several ways exist to interact with the cluster through the Client (e.g. .submit or .map functions). Details about the Client of Dask can be found here.
In Dask, a Worker is a Python object and node serving data and performing computations; Jobs are resources submitted to, and managed by, the job queueing system (e.g. OAR, Slurm, etc..).
In Dask-jobqueue, a single Job may include one or more Workers.
The number of Workers can be set by the processes parameter as shown in the configuration above, if your job can be cut into many processes.
To specify the number of Jobs, you can use the scale command. The number of Jobs can either be specified directly as shown in the example above, or indirectly by the cores or memory request:
# 2 job with 1 worker for each will be launched
cluster.scale(2)
# specify total cores
cluster.scale(cores=4)
# specify total memory
cluster.scale(memory="48GB")
The cluster generated a traditional job script and submits it number of times as specified to the job queue.