Advanced OAR
Note | |
---|---|
This page is actively maintained by the Grid'5000 team. If you encounter problems, please report them (see the Support page). Additionally, as it is a wiki page, you are free to make minor corrections yourself if needed. If you would like to suggest a more fundamental change, please contact the Grid'5000 team. |
This tutorial consists of various independent sections describing various details of OAR useful for advanced usage, as well as some tips and tricks. It assumes you are familiar with OAR and Grid5000 basics. If not, please first look at the Getting Started page.
This OAR tutorial focuses on command line usages. It assumes you are using the bash shell (but should be easy to adapt to another shell). It can be read linearly, but you also may pick some random sections. Begin at least by #useful tips.
OAR
Useful tips
- Take the time to carefully configure ssh, as described in the SSH page.
- Use screen or tmux so that your work is not lost if you lose the connection to Grid5000. Moreover, having a screen session opened with one or more shell sessions allows you to leave your work session when you want then get back to it later and recover it exactly as you leaved it.
- Most OAR commands (
oarsub
,oarstat
,oarnodes
) can provide output in various formats:- text (this is the default mode)
- PERL dumper (-D)
- XML (-X)
- Yaml (-Y)
- json (-J)
- Direct access to the OAR database: users can directly access the PostgreSQL OAR database oar2 on the server oardb.
site
.grid5000.fr with the read-only account oarreader. The password is read. - Regarding the
oarsub
command line, you should mostly only see the "host" word, but theoarsub
command can use both the word "host" or "nodes" indifferently in Grid'5000, as nodes is just an alias for host. Prefer using "host". Besides, the word "host" is also to be preferred to the longer "network_address" word in the resources filters (both properties have sometime the same value, but not always). - At job submission time, only important information are printed out by
oarsub
. To have more indication about what is done by OAR on Grid'5000 (like computed resource filter, exceptional granted privileges, …) theoarsub
verbose (-v
) option can be used. - A syntax simplification mechanism was deployed on Grid'5000 to ease job submission, described at OAR Syntax simplification.
Connection to the job's nodes
Two commands can be used to connect to nodes on Grid'5000, oarsh
and ssh
.
ssh
ssh
can only be used when a node is entirely reserved in your job (all CPU cores). Other cases may not allow assigning processes to the correct job, thus connecting with ssh
is not allowed.
For instance, when a node is entirely reserved as follows:
# Set walltime to default (3600 s). OAR_JOB_ID=<JOB_ID> # Interactive mode: waiting... # Starting... user@node-32:~$
If you open a new shell and try to connect to the node with ssh
, it should work:
Linux node-32.site.grid5000.fr 5.10.0-11-amd64 #1 SMP Debian 5.10.92-1 (2022-01-18) x86_64 Debian11-x64-std-2022013022 (Image based on Debian Bullseye for AMD64/EM64T) Maintained by support-staff <support-staff@lists.grid5000.fr> Last login: Wed Feb 23 15:20:32 2022 from 172.16.31.101 user@node-32:~$
However, when reserving for instance only one CPU core of a node:
# Set walltime to default (3600 s). OAR_JOB_ID=<JOB_ID> # Interactive mode: waiting... # Starting... user@node-32:~$
When trying to connect to the node with ssh
in another shell, you get:
To connect using 'ssh' directly, you must have a single job using all available cores on the node. Use 'oarsh' instead. Connection closed by node-32 port 22
oarsh
oarsh
is a frontend to ssh
(the oarsh
command wraps the OpenSSH ssh
command to add some required functions to connect to a job, but provides mostly the same interface, albeit not all options are available). oarsh
internal mechanism at a glance:
- It opens an ssh connection as the
oar
user to the OAR dedicated SSH server running on a node (listening on port 6667) - It detects who you are based on the job id ou a job key: if you indeed have the right to connect to the node (you reserved it in an OAR job), it switches to your user for the execution of the shell or command on the node in the job's context (cgroup/cpuset).
In case of nodes not fully reserved, you will normally use the oarsh
command to connect to nodes instead of ssh
, and oarcp
instead of scp
to copy files to/from the nodes. If you use taktuk
for parallel executions (or a similar tools like pdsh
) or rsync
to synchronize files to/from a node, you have to configure the connector so the command uses oarsh
instead of ssh
underneath (see the man pages of the command to find out how to change the connector, e.g. using -c or -e).
Please note that oarsh
will always works, even for node entirely reserved in one job.
oarsh and job keys
By default, OAR generates an SSH key pair for each job, and oarsh
is used to connect the job's nodes.
oarsh
uses either the OAR_JOB_ID
or OAR_JOB_KEY_FILE
environment variables to know what job to connect. If outside a job shell (e.g. on the frontend), you have to set one of those enrionment variable. But oarsh
works directly if you are already in a job shell as the variables are set.
- Using
OAR_JOB_ID
For instance, create a job requesting 3 hosts (3 nodes):
# Set default walltime to 3600. OAR_JOB_ID=<JOBID> # Interactive mode: waiting... # Starting... ...
Then, in another terminal, assuming the 2nd host in the job is named node-2
:
- Using
OAR_JOB_KEY_FILE
OAR can expose the job key, using the -e
option of oarsub
Then, in another terminal, assuming the 2nd host in the job is named node-2
:
- Alternative
Note that the following command also allows getting a shell in a job, but only on the first default resource (i.e. node).
sharing keys between jobs
Telling OAR to always use the same key may be very convenient.
For that, you must have a passphrase-less ssh key (e.g. dedicated for navigating inside grid5000, do not use your general purpose SSH private key!) to serve as a job key: In your ~/.profile
or ~/.bash_profile
, set:
export OAR_JOB_KEY_FILE=path_to_your_private_key
Then, OAR will always use that key for all jobs, allowing you to connect to your nodes with oarsh
seamlessly from outside the jobs (without worrying about setting the environment variable for each job).
Moreover, if that key is replicated between all Grid5000 sites, and assuming the environment variable OAR_JOB_KEY_FILE
is exported in ~/.profile
or ~/.bash_profile
in all sites, you will be able to connect directly from any frontend to any reserved node in any site.
oarsh vs ssh: tips and tricks
- 1st tip - hide
oarsh
, rename itssh
Creating a symlink from ~/bin/ssh
(assuming it is in the execution PATH
) to /usr/bin/oarsh
allows hidding the wrapper use (as long as the OAR_JOB_ID
or OAR_JOB_KEY_FILE
environment variables are set when connecting from a frontend to a node).
- 2nd tip - using
ssh
directly, withoutoarsh
If using oarsh
does not suit your need, because you would like to use some of the options of ssh
that oarsh
does not support, you can also connect to reserved nodes by using the real ssh
by adding the right set of options to the command. It can also allow a connection to reserved nodes directly from some place where oarsh
is not available (e.g. from outside Grid'5000):
Assuming you have a passphrase-less SSH key (preferably just for internal uses in Grid5000), you can tell oarsub
to use that key as a job key instead of letting OAR generate a new one (see #sharing keys between jobs). Then you can use that key to connect to nodes, even from outside Grid'5000.
- Copy the key to your workstation, for instance outside of Grid5000:
- In Grid5000, submit a job using this key:
- Wait for the job to start. Then in another terminal, from outside Grid5000, try connecting to the node:
workstation :
|
ssh -i ~/
your_internal_private_key_file -p 6667 [any other ssh options] oar@reserved-node .site .g5k |
Finally, this can be hidden in a SSH ProxyCommand (See also SSH#Using_SSH_ProxyCommand_feature_to_ease_the_access_to_hosts_inside_Grid.275000):
After adding the following configuration in your OpenSSH configuration file on your workstation (~/.ssh/config
):
Host *.g5koar ProxyCommand sshg5k-username
@access.grid5000.fr -W "$(basename %h .g5koar):%p" User oar Port 6667 IdentityFile ~/your_internal_private_key_file
ForwardAgent no
Warning: the ProxyCommand
line works if your login shell is bash
. If not you may have to adapt it.
You can just ssh to a reserved node directly from your workstation as follows:
Passive and interactive job modes
Interactive mode
In interactive mode, a shell is opened on the first default resource (i.e. node) of the job (or on the frontend, if the job is of type deploy
or cosystem
). In interactive mode, the job will be terminated as soon as this job's shell is closed or will be killed earlier if the job's walltime
is reached. It can also be killed by an explicit oardel
.
You can experiment with 3 shells. On first shell, to see the list of your running jobs, regularly run:
To see your own jobs. On the second shell, run an interactive job:
Wait for the job to start, run oarstat
, then leave the job, run oarstat
again. Submit another interactive job, and on the third shell, kill it:
Passive mode
In passive mode, the command
that is given to oarsub
is executed on the first default resource (i.e. node) of the job (or on the site's frontend if the job is of type deploy
or cosystem
). The job's duration will be the shorter of the execution time of the command
and the job's given walltime
. That unless the job is terminated beforehand by an explicit oardel
call from the user or administrator.
Special case for jobs of type noop
which are always passive jobs: no command is executed for them. The duration of the job is the given walltime
.
oardel
can also be used to terminate a passive mode reservation. Note that it is only possible to remove the complete reservation, and not individual nodes.
Interactive mode without shell
You may not want a job to open a shell or to run a script when the job starts, for example because you will use the reserved resources from a program whose lifecycle is longer than the job (and which will use the resources by connecting to the job).
One trick to achieve this is to run the job in passive mode with a long sleep
command. One drawback of this method is that the job may terminate with status error if the sleep
is killed. This can be a problem in some situations, eg. when using job dependencies.
Another solution is to use an advance reservation (see below) with a starting date very close in the future, or even with the current date and time.
Batch jobs vs. advance reservation jobs
- Batch jobs
If you do not specify the job's start date (oarsub -r
option), then your job is a batch job. It lets OAR choose the best schedule (start date).
- With batch jobs, you're guaranteed to get the count of allocated resources you requested, because OAR chooses what resources to allocate to the job just before its start. If some resources suddenly become unavailable, OAR changes the assigned resources and/or the start date.
- Therefore, you cannot get the actual list of resources until the job starts (but a forecast is provided, such as what is shown in the Drawgantt diagrams).
- With batch jobs, you cannot know the start date of your job until it actually starts (any event can change the forecast). But OAR gives an estimation of the start date (such as shown in the Drawgantt diagram, which also changes after any event).
- Advance reservations
If you specify the job's start date, it is an advance reservation. OAR will just try to find resources for the given schedule, fixed by you.
- The Grid5000 usage policy allows no more than 2 advance reservations per site (excluding reservations that start in less than one hour)
- With advance reservation jobs, you're not guaranteed to get the count of resources you requested, because OAR planned the allocation of resources at the reservation time.
- If some resources became unavailable when the job has to start, the job is delayed a bit in case resources may come back (e.g. return from standby).
- If after 400 seconds, if not all resources are available, the job will start with fewer resources than initially allocated. This is however quite unusual.
- The list of allocated resources to an advance reservation job is fixed and known as soon as the advance reservation is validated. But you will get the actual list of resources (that is with unavailable resources removed for it) when the advance reservation starts.
- To coordinate the start date of OAR jobs on several sites, oargrid or funk use advance reservations.
Example: a reservation for a job in one week from now
$ oarsub -r "$(date +'%F %T' --date='+1 week')"
For advance reservations, there is no interactive mode. You can give OAR a command to execute or nothing. If you do not give a command, you'll have to connect to the jobs once the reservation starts (using oarsub -C <jobid> or oarsh).
Getting information about a job
The oarstat
command gets jobs informations. By default it lists the current jobs of all users. You can restrict it to your own jobs or someone else's jobs with option -u
:
$ oarstat -u
You can get full details of a job:
$ oarstat -fj <JOBID>
If scripting OAR and regularly polling job states with oarstat, you can cause a high load on the OAR server (because default oarstat invocation causes costly SQL request in the OAR database). In this case, you should use option -s
which is optimized and only queries the current state of a given job:
$ oarstat -s -j <JOBID>
Complex resources selection
The complete selector format syntax (oarsub -l
option) is:
"-l {sql1}/name1=n1/name2=n2+{sql2}/name3=n3/name4=n4/name5=n5+...,walltime=hh:mm:ss"
where
- sqlN are optional SQL predicates on the resource properties (e.g. mem, ib_rate, gpu_count, ...)
- nameN=n are the wanted number of given resources of name nameN (e.g. host, cpu, core, disk...).
- slashes (/) between resources express resource subtree selection
- + allows aggregating different resource specifications
- walltime=hh:mm::ss (separated by a comma) sets the job walltime (expected duration), which defaults to 1 hour
- List resource properties
You can get the list of resource properties for SQL predicates by running the oarprint -l
command on a node:
sagittaire-1 $ oarprint -l List of properties: disktype, gpu_count, ...
You can get the property values set to resources using the oarnodes
:
flyon $ oarnodes -Y --sql="host = 'sagittaire-1.lyon.grid5000.fr'"
These OAR properties are described in the OAR Properties page.
Using the resources hierarchies
The OAR resources define implicit hierarchies to be used on the resource requests (oarsub -l). These hierarchies are specific to Grid'5000.
- For instance
- request 1 core on 15 hosts (nodes) on a same cluster (total = 15 cores)
$ oarsub -I -l /cluster=1/host=15/core=1
- request 1 core on 15 hosts (nodes) on 2 clusters (total = 30 cores)
$ oarsub -I -l /cluster=2/host=15/core=1
- request 1 core on 2 cpus on 15 hosts (nodes) on a same cluster (total = 30 cores)
$ oarsub -I -l /cluster=1/host=15/cpu=2/core=1
- request 10 cpus on 2 clusters (total = 20 cpus, the number of hosts and cores depends on the topology of the machines)
$ oarsub -I -l /cluster=2/cpu=10
- request 1 core on 3 different network switches (total = 3 cores)
$ oarsub -I -l /switch=3/core=1
- Examples for GPUs
- request 3 GPUs on 1 single host (node). Obviously eligible nodes for the job need to have at least 3 GPU.
$ oarsub -I -l host=1/gpu=3
- request 3 GPUs, possibly on different nodes depending on availability (other jobs, possible resources):
$ oarsub -I -l gpu=3
- request a full node (possibly featuring more than 3 GPUs) with at lease 3 GPUs:
$ oarsub -p "gpu_count >= 3" -l host=1 [...]
- In the job, running oarprint as follows shows what GPUs are available in the job:
$ oarprint gpu -P host,gpudevice
(you may also look at nvidia-smi's output)
- Valid resource hierarchies are
- Compute and disk resources
- both switch > cluster, or cluster > switch can be valid (some clusters spread their hosts (nodes) on many switches, some clusters share a same switch), we note below cluster|switch to reflect that ambiguity.
- cluster|switch > chassis > host > cpu > gpu > core
- cluster|switch > chassis > host > disk
- vlan resources
- vlan only
- subnet resources
- slash16 > slash17 > slash18 > slash19 > slash20 > slash21 > slash22
Of course not all hierarchy levels have to be given in a resource request.
Selecting resources using properties
The properties of the resources are described in the OAR Properties page.
- Selecting nodes from a specific cluster
For example in Nancy:
$ oarsub -I -l {"cluster='graphene'"}/host=2
Or, alternative syntax:
$ oarsub -I -p "cluster='graphene'" -l /host=2
- Selecting nodes with a specific CPU architecture
For classical x86_64:
$ oarsub -I -p "cpuarch='x86_64'"
Other architectures are "exotic" so a specific type of job is needed:
$ oarsub -I -t exotic -p "cpuarch='ppc64le'"
- Selecting specific nodes
For example in Lyon:
$ oarsub -I -l {"host in ('sagittaire-10.lyon.grid5000.fr', 'sagittaire-11.lyon.grid5000.fr', 'sagittaire-12.lyon.grid5000.fr')"}/host=1
or, alternative syntax:
$ oarsub -I -p "host in ('sagittaire-10.lyon.grid5000.fr', 'sagittaire-11.lyon.grid5000.fr', 'sagittaire-12.lyon.grid5000.fr')" -l /nodes=1
By negating the SQL clause, you can also exclude some nodes.
- Other examples using properties
Ask for 10 cores of the cluster graphene
$ oarsub -I -l core=10 -p "cluster='graphene'"
Ask for 2 nodes with 16384 GB of memory and Infiniband 20G
$ oarsub -I -p "memnode='16384' and ib_rate='20'" -l host=2
Ask for any 4 nodes except graphene-12
$ oarsub -I -p "not host like 'graphene-12.%'" -l host=4
- Examples of joint resources requests
Ask for 2 nodes with virtualization capability, on different clusters + IP subnets:
- We want 2 nodes (hosts) and 4 /22 subnets with the following constraints:
- Nodes are on 2 different clusters of the same site (Hint: use a site with several clusters :-D)
- Nodes have virtualization capability enabled
- /22 subnets are on two different /19 subnets
- 2 subnets belonging to the same /19 subnet are consecutive
$ oarsub -I -l /slash_19=2/slash_22=2+{"virtual!='none'"}/cluster=2/host=1
Lets verify the reservation:
$ uniq $OAR_NODE_FILE graphene-43.nancy.grid5000.fr graphite-3.nancy.grid5000.fr
$ g5k-subnets -p 10.144.32.0/22 10.144.36.0/22 10.144.0.0/22 10.144.4.0/22
$ g5k-subnets -ps 10.144.0.0/21 10.144.32.0/21
Another example, ask for both
- 1 core on 2 hosts (nodes) on the same cluster with 16384 MB of memory and Infiniband 20G
- 1 cpu on 2 hosts (nodes) on the same switch with 8 cores processors for a walltime of 4 hours
$ oarsub -I -l "{memnode=16384 and ib_rate='20'}/cluster=1/host=2/core=1+{cpucore=8}/switch=1/host=2/cpu=1,walltime=4:0:0"
Walltime must always be the last argument of -l <...>
Handling the resources allocated to my job with oarprint
The oarprint
allows to print nicely the resources of a job.
We first submit a job
$ oarsub -I -l host=4 ... OAR_JOB_ID=178361
- Retrieve the nodes list
We want the list of the nodes (hosts) we got, identified by unique hostnames
$ oarprint host sagittaire-32.lyon.grid5000.fr capricorne-34.lyon.grid5000.fr sagittaire-63.lyon.grid5000.fr sagittaire-28.lyon.grid5000.fr
(We get 1 line per host, not per core !)
- Retrieve the core list
$ oarprint core 63 241 64 163 243 244 164 242
Obviously, retrieving OAR internal core Id might not help much. Hence the use of a customized output format below.
- Retrieve core list with host and cpuset Id as identifier
We want to identify our cores by their associated host names and cpuset Ids:
$ oarprint core -P host,cpuset capricorne-34.lyon.grid5000.fr 0 sagittaire-32.lyon.grid5000.fr 0 capricorne-34.lyon.grid5000.fr 1 sagittaire-28.lyon.grid5000.fr 0 sagittaire-63.lyon.grid5000.fr 0 sagittaire-63.lyon.grid5000.fr 1 sagittaire-28.lyon.grid5000.fr 1 sagittaire-32.lyon.grid5000.fr 1
- A more complex example with a customized output format
We want to identify our cores by their associated host name and cpuset Id, and get the memory information as well, with a customized output format
$ oarprint core -P host,cpuset,memnode -F "NODE=%[%] MEM=%" NODE=capricorne-34.lyon.grid5000.fr[0] MEM=2048 NODE=sagittaire-32.lyon.grid5000.fr[0] MEM=2048 NODE=capricorne-34.lyon.grid5000.fr[1] MEM=2048 NODE=sagittaire-28.lyon.grid5000.fr[0] MEM=2048 NODE=sagittaire-63.lyon.grid5000.fr[0] MEM=2048 NODE=sagittaire-63.lyon.grid5000.fr[1] MEM=2048 NODE=sagittaire-28.lyon.grid5000.fr[1] MEM=2048 NODE=sagittaire-32.lyon.grid5000.fr[1] MEM=2048
- From the submission frontend
If you are not in a job shell ($OAR_RESOURCE_PROPERTIES_FILE
is not defined), running oarprint
will give:
$ oarprint /usr/bin/oarprint: no input data available
In that case, you can however pipe the output of the oarstat
command in oarprint
, e.g.:
$ oarstat -j <JOB_ID> -p | oarprint core -P host,cpuset,memnode -F "%[%] (%)" -f - capricorne-34.lyon.grid5000.fr[0] (2048) sagittaire-32.lyon.grid5000.fr[0] (2048) capricorne-34.lyon.grid5000.fr[1] (2048) sagittaire-28.lyon.grid5000.fr[0] (2048) sagittaire-63.lyon.grid5000.fr[0] (2048) sagittaire-63.lyon.grid5000.fr[1] (2048) sagittaire-28.lyon.grid5000.fr[1] (2048) sagittaire-32.lyon.grid5000.fr[1] (2048)
- List the OAR properties to use with oarprint
Properties are descibed in the OAR Properties page, but they can also be listed using the oarprint -l
command:
$ oarprint -l List of properties: disktype, gpu_count, ...
X11 forwarding
X11 forwarding is enabled in the shell opened in interactive job (oarsub -I
).
X11 forwarding can also be enabled in a shell opened on a node of a job with oarsh
, just like with a classic ssh
command: The -X
or -Y
option must be passed to oarsh
.
Note | |
---|---|
Please mind that for X11 forwarding to work in the job, X11 forwarding must already work in the shell from which the OAR commands are run. Check the |
We will use xterm
to test X11.
- Enabling X11 forwarding up to the frontend
Connect to a frontend with ssh
(reminder: read the getting started tutorial about the use of the ssh proxycommand), and make sure the X11 forwarding is operational so far:
Look at the DISPLAY
environment variable, which ssh should have set to localhost
:10.0
or the like (the 10.0
part may vary from hop to hop in the X11 forwarding chain, with numbers greater than 10).
It requires to use the -X
or -Y
option in the ssh
command line, or to have ForwardX11=yes
set in your SSH configuration.
In any case, check:
localhost:11.0
- Using X11 forwarding in the
oarsub
job shell
If the DISPLAY
environment variable is set in the calling shell, oarsub
will automatically enable the X11 forwarding. Verbose oarsub option (-v
) is required to have the "Initialize X11 forwarding..." sentence.
# Set default walltime to 3600. # Computed global resource filter: -p "maintenance = 'NO'" # Computed resource request: -l {"type = 'default'"}/core=1 # Generate a job key... OAR_JOB_ID=4926 # Interactive mode: waiting... # Starting... # Initialize X11 forwarding... # Connect to OAR job 4926 via node idpot-8.grenoble.grid5000.fr
Then from the shell of the job, check again the display:
jdoe@idpot-8:~$ echo $DISPLAY localhost:10.0
And run xterm
jdoe@idpot-8:~$ xterm
Wait for the window to open: it may be pretty long!
- Using X11 forwarding in a job via
oarsh
With oarsh
, the -X
or -Y
option must be used to enable the X11 forwarding:
Then in the opened shell, you can again check that the DISPLAY
is set, and run xterm
.
You can also just run the xterm
command directly in the oarsh
call:
- Using X11 forwarding in a job with a deployed environment
When an interactive job is used to deploy an environment, the spawned shell will not contain the DISPLAY
environment variable, even if it was forwarded in the user connection shell.
To use X11 forwarding in this situation, you can open a new (X11 forwarded) shell on the frontend, and then connect to the node using again X11 forwarding.
you can also connect directly to the node from your laptop either by:
- using the Grid'5000 VPN
- following the recommendations about a better usage of ssh listed in Getting Started document.
Note | |
---|---|
X11 forwarding will suffer from the latency between your local network and the Grid'5000 network.
|
Using best effort mode jobs
Best effort job campaign
OAR 2 provides a way to specify that jobs are best effort, which means that the server can delete them if room is needed to fit other jobs. One can submit such jobs using the besteffort type of job.
For instance you can run a job campaign as follows:
for param in $(< ./paramlist); do oarsub -t besteffort -l core=1 "./my_script.sh $param" done
In this example, the file ./paramlist
contains a list of parameters for a parametric application.
The following demonstrates the mechanism.
Note | |
---|---|
Please have a look at the UsagePolicy to avoid abuses. |
Best effort job mechanism
- Running a besteffort job in a first shell
frennes:~$ oarsub -I -l host=10 -t besteffort # Set default walltime to 3600. OAR_JOB_ID=988535 # Interactive mode: waiting... # Starting... parasilo-26:~$ uniq $OAR_FILE_NODES parasilo-26.rennes.grid5000.fr parasilo-27.rennes.grid5000.fr parasilo-28.rennes.grid5000.fr parasilo-3.rennes.grid5000.fr parasilo-4.rennes.grid5000.fr parasilo-5.rennes.grid5000.fr parasilo-6.rennes.grid5000.fr parasilo-7.rennes.grid5000.fr parasilo-8.rennes.grid5000.fr parasilo-9.rennes.grid5000.fr
- Running a non best effort job on the same set of resources in a second shell
frennes:~$ oarsub -I -l {"host in ('parasilo-9.rennes.grid5000.fr')"}/host=1 # Set default walltime to 3600. OAR_JOB_ID=988546 # Interactive mode: waiting... # [2022-01-10 16:00:07] Start prediction: 2022-01-10 16:00:07 (FIFO scheduling OK) # Starting... Connect to OAR job 988546 via the node parasilo-9.rennes.grid5000.fr
As expected, meanwhile the best effort job was stopped (watch the first shell):
parasilo-26:~$ Connection to parasilo-26.rennes.grid5000.fr closed by remote host. Connection to parasilo-26.rennes.grid5000.fr closed. # Error: job was terminated. Disconnected from OAR job 988545
Using the checkpointing trigger mechanism
- Writing the test script
Here is a script which features an infinite loop and a signal handler trigged by SIGUSR2 (default signal for OAR's checkpointing mechanism).
#!/bin/bash handler() { echo "Caught checkpoint signal at: `date`"; echo "Terminating."; exit 0; } trap handler SIGUSR2 cat <<EOF Hostname: `hostname` Pid: $$ Starting job at: `date` EOF while : ; do sleep 10; done
- Running the job
We run the job on 1 core, and a walltime of 5 minutes, and ask the job to be checkpointed if it lasts (and it will indeed) more than walltime - 150 sec = 2 min 30.
$ oarsub -v -l "core=1,walltime=0:05:00" --checkpoint 150 ./checkpoint.sh # Modify resource description with type constraints OAR_JOB_ID=988555 $
- Result
Taking a look at the job output:
$ cat OAR.988555.stdout Hostname: parasilo-9.rennes.grid5000.fr Pid: 12013 Starting job at: Mon Jan 15 14:05:50 CET 2018 Caught checkpoint signal at: Mon Jan 15 14:08:30 CET 2018 Terminating.
The checkpointing signal was sent to the job 2 minutes 30 before the walltime as expected so that the job can finish nicely.
- Interactive checkpointing
The oardel
command provides the capability to raise a checkpoint event interactively to a job.
We submit the job again
$ oarsub -v -l "core=1,walltime=0:05:0" --checkpoint 150 ./checkpoint.sh # Modify resource description with type constraints OAR_JOB_ID=988560
Then run the oardel -c #jobid
command...
$ oardel -c 988560 Checkpointing the job 988560 ...DONE. The job 988560 was notified to checkpoint itself (send SIGUSR2).
And then watch the job's output:
$ cat OAR.988560.stdout Hostname: parasilo-4.rennes.grid5000.fr Pid: 11612 Starting job at: Mon Jan 15 14:17:25 CET 2018 Caught checkpoint signal at: Mon Jan 15 14:17:35 CET 2018 Terminating.
The job terminated as expected.
Using jobs dependency
A job can wait for the termination of a previous job.
- First Job
We run a first interactive job in a first Shell
frennes:~$ oarsub -I # Set default walltime to 3600. OAR_JOB_ID=988571 # Interactive mode: waiting... # Starting... parasilo-28:~$
And leave that job pending.
- Second Job
Then we run a second job in another Shell, with a dependence on the first one
jdoe@idpot:~$ oarsub -I -a 988571 # Set default walltime to 3600. OAR_JOB_ID=2071596 # Interactive mode: waiting... # [2018-01-15 14:27:08] Start prediction: 2018-01-15 15:30:23 (FIFO scheduling OK)
- Job dependency in action
We do a logout on the first interactive job...
parasilo-28:~$ logout Connection to parasilo-28.rennes.grid5000.fr closed. Disconnected from OAR job 988571
... then watch the second Shell and see the second job starting
# [2018-01-15 14:27:08] Start prediction: 2018-01-15 15:30:23 (FIFO scheduling OK) # Starting... parasilo-3:~$
Container jobs
With the container job functionality, OAR allows for someone to execute inner jobs within the boundaries of the container job. Inner jobs are scheduled using the same algorithm as other jobs, but restricted to the container job's resources and timespan.
A typical use case is to submit first a container job, then have inner jobs submitted, with referring to the container job_id.
Mind that the inner jobs that will not fit in the container's boundaries will stay in the waiting state in the queue, not scheduled and not executed. They will be deleted when the container job is terminated.
Container jobs are especially useful when organizing tutorial of teaching labs, with the container job created by the organizer, and inner jobs created by the attendees.
Mind that if in your use case, all inner job are to be created by the same user as the container job, it is preferable to use a tool such as GNU Parallel.
Inner job are killed when the container job is terminated.
- First a job of the type container must be submitted
... OAR_JOB_ID=42 ...
- Then it is possible to use the inner type to schedule the new jobs within the previously created container job
Note | |
---|---|
A job created with: will never be scheduled because the container job "42" only reserved 10 nodes. |
cosystem and noop jobs
- cosystem
- Jobs of type cosystem, just like jobs of type deploy, do not execute on the first node assigned to the job but on the frontend.
- But unlike deploy jobs, cosystem jobs do not grant any special privileges (e.g. no kareboot right).
- noop
- Jobs of type noop do not execute anything at all. They just allocate resources for a time frame.
- noop jobs cannot be interactive (oarsub -I).
- noop jobs have the advantage over the cosystem job that they are not affected by a reboot (e.g. due to a maintenance or a failure) of the frontend.
If running a script on the frontend is not required, noop job should probably be preferred over the cosystem jobs.
Changing the walltime of a running job (oarwalltime)
Users can request a extension of the walltime (duration of the resource reservation) of a running job. This can be achieved using the oarwalltime
command or Grid'5000's API.
This change can be specified by giving either a new walltime value or an increase (begin with +).
Please note that a request may stay partially or completely unsatisfied if a job is already scheduled to occupy the resources right after the running job.
Job must be running for a walltime change. For Waiting job, delete and resubmit.
Warning | |
---|---|
While changes of walltime are not limited a priori (by the |
Command line interface
Querying the walltime change status:
Walltime change status for job 1743185 (job is running): Current walltime: 1:0:0 Possible increase: UNLIMITED Already granted: 0:0:0 Pending/unsatisfied: 0:0:0
Requesting the walltime change:
Accepted: walltime change request updated for job 1743185, it will be handled shortly.
Querying right afterward:
Walltime change status for job 1743185 (job is running): Current walltime: 1:0:0 Possible increase: UNLIMITED Already granted: 0:0:0 Pending/unsatisfied: +1:30:0
The request is still to be handled by OAR's scheduler.
Querying again a bit later:
Walltime change status for job 1743185 (job is running): Current walltime: 2:30:0 Possible increase: UNLIMITED Already granted: +1:30:0 Pending/unsatisfied: 0:0:0
May a job exist on the resources and partially prevent the walltime increase, the query output would be:
Walltime change status for job 1743185 (job is running): Current walltime: 2:30:0 Possible increase: UNLIMITED Already granted: +1:10:0 Pending/unsatisfied: +0:20:0
Changes events are also reported in oarstat
.
See man oarwalltime
for more information.
Using the REST API
Requesting the walltime change:
curl
-i -X POST https://api.grid5000.fr/3.0/sites/grenoble
/internal/oarapi/jobs/1743185
.json -H'Content-Type: application/json' -d '{"method":"walltime-change", "walltime":"+0:30:0
"}'
Querying the status of the walltime change:
curl
-i -X GET https://api.grid5000.fr/3.0/sites/grenoble
/internal/oarapi/jobs/1743185
/details.json -H'Content-Type: application/json'
See the walltime-change and events keys of the output.
Restricting jobs to daytime or night/week-end time
To help submitting batch jobs fitting inside the time frames defined in the usage policy (day vs. night and week-end), the types day
and night
can be used (oarsub -t <type>…
).
- Submit a job to run during the current day time
As such:
- It will be forced to run between 9:00 and 19:00, or the next day if the job is submitted during the night.
- If the job did not succeed to run before 19:00, it will be deleted.
- Submit a job to run during the coming (or current) night (or week-end on Friday)
As such:
- It will be forced to run after 19:00, and before 9:00 for week nights (Monday to Thursday nights), or before 9:00 on the next Monday for a job which runs during a week-end.
- If a job could not be scheduled during the current night (not enough resources available), it will be kept in the queue and then postponed in the morning for a retry the next night (hour constraints will be changed to the next night slot), that for 7 days.
- If the walltime of the job is more than 13h59, the job will obviously not run before a weekend.
- Submit a job to run exclusively during the coming (or current) night (or week-end on Friday)
If job is not scheduled and run during the coming (or current) night (or week-end on Friday), it will not be postponed to the next night for a new try, but just set to error.
Note that:
- the maximum walltime for a night is 14h, but due to some overhead in the system (resources state changes, reboots...), it is strongly advised to limit walltime to at most 13h30. Furthermore, a shorter walltime (max a few hours)? will result in more chances to get a job scheduled in case many jobs are already in queue.
- jobs with a walltime greater than 14h will be required to run during the week-ends. But even if submitted at the beginning of the week, they will not be scheduled before the Friday morning. Thus, any advance reservation done before Friday will take precedence. Also, given that the rescheduling happens on a daily basis for the next night, advance reservations take precedence if they are submitted before the daily rescheduling. In practice, this mechanism thus provides a low priority way to submit batch jobs during nights and week-ends.
- a job will be kept 7 days before deletion (if it cannot be run because of lack of resources within a week), unless using
night=noretry
Multi-site jobs with OARGrid
oargrid alows submitting OAR jobs to several Grid'5000 sites at once.
For instance, we are going to reserve 4 nodes on 3 different sites for half an hour
Note that in grid reservation mode, no script can be specified. Users are in charge to:
- connect to the allocated nodes.
- launch their experiment.
OAR Grid connects to each of the specified clusters and makes a passive submission. Cluster job ids are returned by OAR. A grid job id is returned by OAR Grid to bind cluster jobs ids together.
You should see an output like this:
SITE1
:rdef=/nodes=2,SITE2
:rdef=/nodes=1,SITE3
:rdef=nodes=1 [OAR_GRIDSUB] [SITE3
] Date/TZ adjustment: 0 seconds [OAR_GRIDSUB] [SITE3
] Reservation success onSITE3
: batchId =SITE_JOB_ID3
[OAR_GRIDSUB] [SITE2
] Date/TZ adjustment: 1 seconds [OAR_GRIDSUB] [SITE2
] Reservation success onSITE2
: batchId =SITE_JOB_ID2
[OAR_GRIDSUB] [SITE1
] Date/TZ adjustment: 0 seconds [OAR_GRIDSUB] [SITE1
] Reservation success onSITE1
: batchId =SITE_JOB_ID1
[OAR_GRIDSUB] Grid reservation id =GRID_JOB_ID
[OAR_GRIDSUB] SSH KEY : /tmp/oargrid//oargrid_ssh_key_LOGIN
_GRID_JOB_ID
You can use this key to connect directly to your OAR nodes with the oar user.
Fetch the allocated nodes list to transmit it to the script we want to run:
Note | |
---|---|
The
|
(1) Select the node to launch the script (ie: the first node listed in the ~/machines
file).
If (and only if) this node does not belong to the site where the ~/machines
file was saved,
copy the ~/machines
to this node:
frontend :
|
OAR_JOB_ID=SITE_JOB_ID oarcp -i /tmp/oargrid/oargrid_ssh_key_LOGIN_GRID_JOB_ID ~/machines `head -n 1 machines`: |
(2) Connect to this node using oarsh
:
frontend :
|
OAR_JOB_ID=SITE_JOB_ID oarsh -i /tmp/oargrid/oargrid_ssh_key_LOGIN_GRID_JOB_ID `head -n 1 machines` |
And then run the script:
The Grid counterpart of oarstat
gives information about the grid job:
Our grid submission is interactive, so its end time is unrelated to the end time of our script run. The submission ends when the submission owner requests that it ends or when the submission deadline is reached.
We are going to ask for our submission to end:
Funk
funk
is grid resources discovery tool that works at nodes level and generate complex oarsub
/oargridsub
commands. It can help you in three cases:
- to know the number of nodes availables for 2 hours at run time, on sites lille, rennes and on clusters taurus and suno
- to know when 40 nodes on sagittaire and 4 nodes on taurus will be available, with deploy job type and a subnet
- to find the time when the maximum number of nodes are available during 10 hours, before next week deadline, avoiding usage policy periods, and not using genepi
More information on its dedicated page.
OAR in the Grid'5000 API
An other way to visualize nodes/jobs status is to use the Grid'5000 API