HPC and HTC tutorial

From Grid5000
Revision as of 14:14, 7 March 2024 by Pneyron (talk | contribs)
Jump to navigation Jump to search

Template:Territorialiser

Grid'5000 gives an easy access to a wide variety of hardware technologies and is particularly suitable to carry out HPC (high performance computing) experiments: users can investigate parallel algorithms, scalability problems or performance portability on Grid'5000. The first intent of Grid'5000 is to be a testbed for experiment-driven research in all areas of computer science with a focus on parallel and distributed computing including Cloud, HPC and Big Data.

However, Grid5000 offers such a large amount of resources that it allows the use of its idle resources for workloads which are more production oriented (The goal is just to obtain results faster, without regard to the method that is used).

Those include HTC (High-throughput computing) projects requiring the execution of a large number of loosely-coupled tasks (also called an embarrassingly parallel workload).

Whereas HPC production systems generally have rather rigid restrictions (no root access, no possibility to install system-wide software, no ssh connection to the compute nodes, no internet access...), Grid'5000 does not suffer from these common limitations of HPC centers.

In particular, Grid'5000 has a job scheduling policy that allow reservations in advance of resources, which is useful for setting up an experiment on your own schedule and preform some interactive tasks on the reserved resources.

You can also reinstall cluster nodes and gain root access during the time of your jobs using Kadeploy. This can be used to control the entire software stack, experiments with runtime environments, fine-tune network parameters (ex. MTU) or to simply ensure the reproducibility of your experiments by freezing its context.

In addition, Grid'5000 provides a set of tools for monitoring experiments.


Resource available on Grid'5000

The easiest way to get the global picture of the HPC systems available on Grid'5000 is to consult the Hardware page.

This page is built using the Grid'5000 Reference API and describes in detail the CPU models, network interfaces and accelerators of each cluster.

You can also use the API Quick Start page as it provides advanced filters for selecting nodes by hardware capability. Alternatively, you can parse the Grid'5000 Reference API yourself to discover the available resources on each site.

Resource reservation

Resource reservation using the OAR scheduler is covered by the Getting Started tutorial and more deeply in the Advanced OAR document.

You can select specific hardware by using the "-p" (properties) option of the oarsub command.

The list of OAR Properties available on each site is listed in each Monika status page (links in the Status page). For instance, see the Monika job status page for Nancy.

You can combine OAR properties or even use SQL queries for advance filtering.

Here is a non-exhaustive list of OAR properties for HPC experiments:

  • CPU: cpuarch, cpucore, cpufreq, cputype
  • Memory (RAM in MB): memnode (memory per node), memcpu (per cpu), memcore (per core)
  • Network:
    • eth_count (number of ethernet interfaces), eth_rate (rate of the fastest ethernet interface available on the node)
    • ib_count (number of InfiniBand interfaces), ib = {'NO', 'SDR', 'DDR', 'QDR', 'FDR'} (the InfiniBand technology available), ib_rate = {0, 10, 20, 40, 56} (max rate in Gbit/s)
  • Accelerator: gpu_model (model of GPU), gpu_count (number of GPU per node), mic (YES/NO)

For example, you can make a reservation at Lyon for a GPU node using:

Terminal.png flyon:
oarsub -I -p "gpu_count > 0"

Or get a node with at least 256 Go of RAM at Nancy:

Terminal.png fnancy:
oarsub -I -p "memnode > 256000"

Some resources are available only in a specific queue (the production queue in Nancy, whose access is reserved only to members of a Gold level Group Granting Access), or usable only if the user explicitly reserves them acknowledging that she knows that they are "special" (exotic) and that the usage of the resource is intentional (e.g ARM core).

Example to get a reservation for a host with GPU in Nancy (available in production queue only):

Terminal.png fnancy:
oarsub -I -p "gpu_count > 0" -q production

Or to get a reservation on our ARM cluster in Lyon

Terminal.png flyon:
oarsub -I -p "cluster='pyxis'" -t exotic

Reservations are subject to rules described in the Usage Policy document. Computational quota on Grid'5000 differs from what is offered traditionally in HPC centers; the idea for the default queue (available in all sites) is to let the resources usable by a lot of different people during the week day to prepare experiments or run limited experiment (no more than 2 core hours of a full cluster), and to use nights and week-ends for out of quota experiments.

Sure, a week-end may seem a short time when it comes to HPC. For longer jobs, Grid'5000 offers the following solutions:

  • use, if granted, the production queue: out of daily quota but limited to one week of computation, and not on a full cluster (eg: 2/3 of a 64 nodes cluster)
  • use the so-called best-effort mode to submit jobs, available on every site, in every queue, but which requires that the user adapts his experimentation to this mode of computation (more on this bellow)

Different ways to use Grid'5000 as an HPC production system

Oarsub

A simple way to submit a job on Grid'5000 is to use the oarsub command:

Terminal.png grenoble:
oarsub -l nodes=1,walltime=0:15 stress -c 32 -t 10
[ADMISSION RULE] Computed global resource filter: -p "maintenance = 'NO'"
[ADMISSION_RULE] Computed resource request: -l {"((type = 'default') AND exotic = 'NO') AND production = 'NO'"}/host=1
Generate a job key...
OAR_JOB_ID=2646739

(in this example, the job will run the stress -c 32 -t 10 command)

Job status

The returned job id can be used to query OAR about the status of the job:

Terminal.png grenoble:
oarstat -j 2646739
Job id     Name           User           Submission Date     S Queue
---------- -------------- -------------- ------------------- - ----------
2646739                   jdoe           2020-07-13 14:00:56 W default   

The W status above means that the job is waiting for its required resources to be available.

Terminal.png grenoble:
oarstat -j 2646739
Job id     Name           User           Submission Date     S Queue
---------- -------------- -------------- ------------------- - ----------
2646739                   jdoe           2020-07-13 14:00:56 R default  

The R status means that the job is currently running.

Terminal.png grenoble:
oarstat -j 2646739
Job id     Name           User           Submission Date     S Queue
---------- -------------- -------------- ------------------- - ----------
2646739                   jdoe           2020-07-13 14:00:56 T default   

And the T status means that the job has terminated.

There are two others job status that worth to be mentioned:

  • E (error) which indicates that the job has been interrupted because it has exceeded its allocated walltime, or any other unexpected issue
  • F (finishing) which means that the job is about to terminate
Job stdout and stderr streams

Standard output and error of the job are stored in files named after the job id returned by oar:

Terminal.png grenoble:
cat OAR.2646739.stdout

for instance contains:

stress: info: [7005] dispatching hogs: 32 cpu, 0 io, 0 vm, 0 hdd
stress: info: [7005] successful run completed in 10s
Terminal.png grenoble:
cat OAR.2646739.stderr

is empty because nothing was written by the command on the standard error stream.

GNU Parallel on top of OAR

GNU Parallel is a well known tool that can be used to execute tasks (shell scripts, executables) in parallel, on one or multiples hosts.

To a certain extent, GNU Parallel can be used with no particular assumption about the underlying infrastructure. But when combined with OAR properties of reserved resources, it is possible to distribute tasks on several hosts with an optimal distribution and in a very efficient way, thanks to the OAR's knowledge of the infrastructure.

Using GNU Parallel on top of OAR is described in detail in our GNU Parallel tutorial.

Compared to other solutions described bellow, GNU Parallel is optimal at parallelizing task on Grid5000 because it needs only one reservation (minimal overhead), and it take cares of the distribution of computing load on reserved resources. But as always, there is no such thing as a silver bullet, and in some use cases (for instance when the number of tasks of an experiment is discovered at runtime) GNU Parallel is not the best tool at hand.

Besteffort

Best effort jobs can be used to run experiments out of usage policy quota or time restriction, on any resources, including those in the production queue.

The counterpart is that a best effort job can be interrupted at any time (killed) if it uses resources which are at anytime requested to start a non-besteffort job.

It is up to the user to decide how to manage this situation:

  • (A) checkpoint regularly the state of the experiment in order to restart it manually from the last known good state
  • add a checkpoint/restart mechanism to the experiment to be able to automatically resume the experiment, either using:
    • (B) the OAR idempotent/checkpointing functionalities (OAR will restart the experiment until the computation is declared as terminated by the experiment)
    • (C) an external custom task scheduler designed to break out the experiment into multiple best-effort jobs

How to choose between these different ways of using the best effort mechanism depends on what the user is ready to invest into it and the nature of the experiment.

(A) can be sufficient if the experiment can just be monitored by a human. The advantage is that it may require only to modify the experiment (if necessary) to produce regular checkpoints of its state (e.g. the state can be saved as a transaction in a relational database hosted on a user virtual machine; but some tools have already a state saving mechanism already in place and can be interrupted at any time).

(B) injects more knowledge of OAR into the experiment it the sense that it must be able 1) to catch a signal sent by OAR to trigger the checkpointing mechanism before halt, 2) to restart from the previous last know good state using the same command line provided intialy to oarsub (idempotent). This is quite easy to do; and one can think that (B) is close to (A) in terms of adding bestefforting capability to an experiment.

(C) is more complex to setup, but it may be a good investment for a team to design a best-effort task scheduler to manage large scale campaigns of parallel best effort experiments. A very successful example of that on Grid'5000 is what has been done by the Caramba Team for their 2020 records (the factorization of RSA-240, a 795-bit number, and a discrete logarithm computation over a 795-bitprime field. See https://ia.cr/2020/697, section 4.2).

The interested reader can refer to this section of the Advanced OAR usage document (https://www.grid5000.fr/w/Advanced_OAR#Using_best_effort_mode_jobs) to know more about how using besteffort on Grid'5000.

Job containers

Warning.png Warning

Job containers are currently desactivated

The job container type allows the user to reserve a set of resources of interest with a single oarsub command, and then launch multiple (inner) jobs within the job container. More information on this type of job can be found in the Advanced OAR document.

Job containers are mainly useful for reserving resources for tutorials or teaching labs (see the Tutorial or teaching labs How-To), because they allow reserving/sharing resources for/with multiple users. However, they can be handy for HPC tasks whenever the jobs of an experiment have to be submitted on-line (not all at once at the beginning), since that use case is not supported using GNU Parallel.

Gotchas and good practices

Jobs requiring root privileges

Making use of the root privileges (via sudo-g5k or after kadeploying) on a node implies costly operations such as rebooting or even redeploying the reserved node after usage. This is perfectly fine as Grid'5000 is designed for that purpose, but depending on the nature of tasks to perform, it might be overkill. Indeed, using the root privilege can be avoided by using several simple techniques especially relevant in the HPC context:

  • Grid'5000 provides Environment modules, thus users may just have to look at what modules are available to add some software components in their runtime without becoming root, just like in many classical HPC platforms. Furthermore, if the Environment modules library does not include a commonly used module, one can make a request to support-staff@lists.grid5000.fr to provide it.
  • Tooling like Conda/Bioconda (Multiple languages), Virtualenv (Python), sdkman (Java ecosystem), rvm (Ruby) etc ... can be used to install packages without the root privileges in the home directory. Since the home directory are shared via NFS, that user installation will be available on all nodes.
  • Singularity, which is available using a Environment Module on Grid'5000, is also a great way to build, deploy and run elaborate pieces of software without requiring the root privileges on the node.

To sum up: remember that classical HPC platform do not provide access to the root privileges, hence, a classic HPC usage typically should not require the root privileges. However, a strength of Grid'5000 that differentiate it form other HPC capable platform is that it allows acquiring the root privileges on the bare metal hardware machines if some tasks need it.

Short lived OAR jobs

It is a bad practice on Grid'5000 to submit a high number of OAR small jobs, which typically will last less than 10 minutes. Indeed the overhead can be significant compared to the elapsed time of each jobs, especially if each job causes a reboot or a redeployment of the system (jobs of type deploy, or using sudo-g5k). Not to mention the fact that this can slow down the overall performance of the OAR server and its derivatives, such as the Gantt web interface.

The best option is to use GNU Parallel on top of OAR and as such submit OAR jobs at a coarser grain (more than 10 minutres, preferably on full nodes rather than a core) and possibly running the smaller tasks (in duration and resources granularity) within those OAR jobs.

Also if a job requires a modification of the nodes in privileged mode (root), it is preferable to run longer jobs with a deployment or the use of sudo-g5k at the beginning, and then launch the smaller tasks, rather than having every small job do the initial privilegied operations. Note that when reproducibility is at stack, one may want the reboot and redeploy before every run of a task, but this in turn does not require a new job: in a same deploy job, one can kareboot and kadeploy as many times as needed.

Full node reservation vs smaller resource reservation

In order to limit the resources fragmentation, and since many jobs in Grid'5000 require full nodes (requirement for a deployment or for using sudo-g5k), it is advised to prefer submitting jobs using full nodes rather than smaller resources such as cores. Only when a user just want 1 core or more relevantly 1 GPU for a single run, it is wise to reserve as such granularity, so that unused resources within the job are not wasted.

Whenever the task execution is spread at the GPU or core grain on many of those resources, using coarser grain OAR job (full nodes) and GNU parallel in the job for the small task execution is wise.

Big datasets

Users quota on each site frontend ranges from 25GB to a maximum of a 200GB (the User Management Service can be used to ask for more disk space). It is worth noting that users's home directories are located on a NFS file server, and as such, are accessible from every reserved node when using an NFS capable environment.

Above that limit of 200GB, it is possible to use the Group Storage service, also based on NFS, which is very flexible, and designed to encourage team data sharing on a larger scale (multiple TBs).

User data on compute nodes

Data for experiment can be transferred and produced directly on the local disks of the reserved nodes.

In all cases, user's data created on the node are deleted after the reservation, either by the removal of all the user's files in the user writable directories (eg: /tmp) when the node has been used without requiring root rights, or by redeploying the full node at the end of the reservation if the node has been deployed by the user of if sudo-g5k has been used.

On some clusters, it is also possible to reserve dedicated local disks up to 2 weeks, independently of the node reservation itself (of course it is possible to reserve both the node and its local reservable disks at the same time).

It is worth noting that these local disk are not accessible to users which have not reserved them (this is not quite true as a skilled and malicious user can bypass this limitation, if it happens, don't hesitate to report the situation to support-staff@lists.grid5000.fr).

Backup

Grid'5000 does not offer a user data backup service, so you have to take care of this by yourself.

File transfers between Grid'5000 and HPC centers

It is of course more efficient to transfer data directly from Grid'5000 to a HPC centers you are using, than let the data transit via your lab during the transfer (be it done in two steps or just one using for instance scp -3)

Note that Grid'5000 allows anybody to initiate data transfer using the SCP/SFTP protocols (e.g. with rsync) to a target in the Internet. However, you may encounter difficulties at the target entrance, due to security constraints imposed by the HPC center.

We provide an entry in our FAQ to describe how to proceed in the case of the Jean Zay supercomputer (and probably other GENCI supercomputers).

Please, don't hesitate to drop an e-mail at support-staff about your data transfer experience related to an HPC center (successful or not); we will add it to the FAQ.

Using HPC hardware and software on Grid'5000

Usual HPC software and hardware can be used on Grid'5000; the following list of tutorials may help the interested reader to quickly use them: