HPC and HTC tutorial: Difference between revisions
No edit summary |
No edit summary |
||
(21 intermediate revisions by 6 users not shown) | |||
Line 5: | Line 5: | ||
{{TutorialHeader}} | {{TutorialHeader}} | ||
'''Grid'5000''' gives an easy access to a '''wide variety of hardware''' technologies and is particularly suitable to '''carry out HPC (high performance computing) experiments''': users can investigate '''parallel algorithms''', '''scalability problems''' or '''performance portability''' on Grid'5000. | '''Grid'5000''' gives an easy access to a '''wide variety of hardware''' technologies and is particularly suitable to '''carry out HPC (high performance computing) experiments''': users can investigate '''parallel algorithms''', '''scalability problems''' or '''performance portability''' on Grid'5000. | ||
The first intent of Grid'5000 is to be a testbed for '''experiment-driven research in all areas of computer science''' with a focus on parallel and distributed computing including Cloud, HPC and Big Data. | The first intent of Grid'5000 is to be a testbed for '''experiment-driven research in all areas of computer science''' with a focus on parallel and distributed computing including Cloud, HPC and Big Data. | ||
However, Grid5000 offers such a '''large amount of resources''' that it allows the use of its idle resources for workloads which are more '''production oriented''' (The goal is just to obtain results faster, without regard to the method that is used). | However, Grid5000 offers such a '''large amount of resources''' that it allows the use of its idle resources for workloads which are more '''production oriented''' (The goal is just to obtain results faster, without regard to the method that is used). | ||
Those include [https://en.wikipedia.org/wiki/High-throughput_computing HTC (High-throughput computing)] projects requiring the execution of a '''large number of loosely-coupled tasks''' (also called an ''embarrassingly parallel'' workload). | Those include [https://en.wikipedia.org/wiki/High-throughput_computing HTC (High-throughput computing)] projects requiring the execution of a '''large number of loosely-coupled tasks''' (also called an ''embarrassingly parallel'' workload). | ||
Whereas HPC production systems generally have rather rigid restrictions (no root access, no possibility to install system-wide software, no ssh connection to the compute nodes, no internet access...), Grid'5000 does not suffer from these common limitations of HPC centers. | Whereas HPC production systems generally have rather rigid restrictions (no root access, no possibility to install system-wide software, no ssh connection to the compute nodes, no internet access...), Grid'5000 does not suffer from these common limitations of HPC centers. | ||
In particular, Grid'5000 has a job scheduling policy that allow '''reservations in advance of resources''', which is useful for setting up an experiment on your own schedule and preform some interactive tasks on the reserved resources. | In particular, Grid'5000 has a job scheduling policy that allow '''reservations in advance of resources''', which is useful for setting up an experiment on your own schedule and preform some interactive tasks on the reserved resources. | ||
You can also '''reinstall''' cluster nodes and '''gain root access''' during the time of your jobs using [[Advanced_Kadeploy|Kadeploy]]. This can be used to control the entire software stack, experiments with runtime environments, fine-tune network parameters (ex. MTU) or to simply ensure the reproducibility of your experiments by freezing its context. | You can also '''reinstall''' cluster nodes and '''gain root access''' during the time of your jobs using [[Advanced_Kadeploy|Kadeploy]]. This can be used to control the entire software stack, experiments with runtime environments, fine-tune network parameters (ex. MTU) or to simply ensure the reproducibility of your experiments by freezing its context. | ||
In addition, Grid'5000 provides a set of tools for monitoring experiments. | In addition, Grid'5000 provides a set of tools for monitoring experiments. | ||
Line 25: | Line 25: | ||
The easiest way to get the global picture of the '''HPC systems available on Grid'5000''' is to consult the [[Hardware]] page. | The easiest way to get the global picture of the '''HPC systems available on Grid'5000''' is to consult the [[Hardware]] page. | ||
This page is built using the [[ | This page is built using the [[API|Grid'5000 Reference API]] and describes in detail the '''CPU''' models, '''network interfaces''' and '''accelerators''' of each cluster. | ||
You can also use the [https://api.grid5000.fr/sid/ui/quick-start.html API Quick Start page] as it provides advanced filters for selecting nodes by hardware capability. Alternatively, you can parse the [[ | You can also use the [https://api.grid5000.fr/sid/ui/quick-start.html API Quick Start page] as it provides advanced filters for selecting nodes by hardware capability. Alternatively, you can parse the [[API_tutorial|Grid'5000 Reference API]] yourself to discover the available resources on each site. | ||
= Resource reservation = | = Resource reservation = | ||
Resource reservation using the '''OAR scheduler''' is covered by the [[Getting_Started|Getting Started]] tutorial and more deeply in the [[Advanced_OAR|Advanced OAR]] document. | Resource reservation using the '''OAR scheduler''' is covered by the [[Getting_Started|Getting Started]] tutorial and more deeply in the [[Advanced_OAR|Advanced OAR]] document. | ||
You can select specific hardware by using the "-p" (properties) option of the <code class="command">oarsub</code> command. | You can select specific hardware by using the "-p" (properties) option of the <code class="command">oarsub</code> command. | ||
Line 40: | Line 40: | ||
Here is a non-exhaustive list of OAR properties for HPC experiments: | Here is a non-exhaustive list of OAR properties for HPC experiments: | ||
* CPU: cpuarch, cpucore, cpufreq, cputype | * CPU: <code>cpuarch</code>, <code>cpucore</code>, <code>cpufreq</code>, <code>cputype</code> | ||
* Memory (RAM in MB): memnode (memory per node), memcpu (per cpu), memcore (per core) | * Memory (RAM in MB): <code>memnode</code> (memory per node), <code>memcpu</code> (per cpu), <code>memcore</code> (per core) | ||
* Network: | * Network: | ||
** eth_count (number of ethernet interfaces), eth_rate (rate of the fastest ethernet interface available on the node) | ** <code>eth_count</code> (number of ethernet interfaces), <code>eth_rate</code> (rate of the fastest ethernet interface available on the node) | ||
** ib_count (number of InfiniBand interfaces), ib = {'NO', 'SDR', 'DDR', 'QDR', 'FDR'} (the InfiniBand technology available), ib_rate = {0, 10, 20, 40, 56} (max rate in Gbit/s) | ** <code>ib_count</code> (number of InfiniBand interfaces), <code>ib = {'NO', 'SDR', 'DDR', 'QDR', 'FDR'}</code> (the InfiniBand technology available), <code>ib_rate = {0, 10, 20, 40, 56}</code> (max rate in Gbit/s) | ||
* Accelerator: gpu_model (model of GPU), gpu_count (number of GPU per node | * Accelerator: <code>gpu_model</code> (model of GPU), <code>gpu_count</code> (number of GPU per node) | ||
For example, you can make a reservation at Lyon for a GPU node using: | For example, you can make a reservation at Lyon for a GPU node using: | ||
Line 53: | Line 53: | ||
{{Term|location=fnancy|cmd=<code class="command">oarsub</code> -I <code class="replace">-p "memnode > 256000"</code>}} | {{Term|location=fnancy|cmd=<code class="command">oarsub</code> -I <code class="replace">-p "memnode > 256000"</code>}} | ||
Some resources are available only | Some resources are available only through specific settings: | ||
Example to get a reservation for a host with GPU in Nancy (available in production queue only): | * '''production queue''', currently only in Nancy: access to production resources is reserved only to members of a [[Grid5000:UsagePolicy#Privilege_levels_table|Gold level Group Granting Access]] | ||
* '''exotic resources''': these resources are special in some way (non-x86 architecture, exotic or rare hardware), so they are only usable by experimenters that explicitly ask resources of type "exotic". | |||
Example to get a reservation for a host with a GPU in Nancy (available in production queue only): | |||
{{Term|location=fnancy|cmd=<code class="command">oarsub</code> -I <code class="replace">-p "gpu_count > 0" -q production</code>}} | {{Term|location=fnancy|cmd=<code class="command">oarsub</code> -I <code class="replace">-p "gpu_count > 0" -q production</code>}} | ||
Or to get a reservation on our ARM cluster in Lyon | Or to get a reservation on our ARM cluster in Lyon | ||
{{Term|location=flyon|cmd=<code class="command">oarsub</code> -I <code class="replace">-p | {{Term|location=flyon|cmd=<code class="command">oarsub</code> -I <code class="replace">-p pyxis -t exotic</code>}} | ||
Reservations are subject to rules described in the [[Grid5000:UsagePolicy|Usage Policy]] document. Computational quota on Grid'5000 differs from what is offered traditionally in HPC centers; the idea for the default queue (available in all sites) is to let the resources usable by a lot of different people during the week day to prepare experiments or run limited experiment (no more than 2 core hours of a full cluster), and to use nights and week-ends for out of quota experiments. | Reservations are subject to rules described in the [[Grid5000:UsagePolicy|Usage Policy]] document. Computational quota on Grid'5000 differs from what is offered traditionally in HPC centers; the idea for the default queue (available in all sites) is to let the resources usable by a lot of different people during the week day to prepare experiments or run limited experiment (no more than 2 core hours of a full cluster), and to use nights and week-ends for out of quota experiments. | ||
A week-end may feel like a short time for HPC jobs. For longer jobs, Grid'5000 offers the following solutions: | |||
* use, if granted, the '''production queue''': | * use, if granted, the '''production queue''': there is no daily quota, but jobs are limited to one week of computation, and not on a full cluster (eg: 2/3 of a 64 nodes cluster) | ||
* use the so-called '''best-effort mode''' to submit jobs, available on every site, in every queue, but | * use the so-called '''best-effort mode''' to submit jobs, available on every site, in every queue, but this requires the user to adapt his/her experiment to this mode of computation (more on this below) | ||
= Different ways to use Grid'5000 as an HPC production system = | = Different ways to use Grid'5000 as an HPC production system = | ||
Line 72: | Line 75: | ||
== Oarsub == | == Oarsub == | ||
A simple way to submit a job on Grid'5000 is to use the oarsub command: | A simple way to submit a job on Grid'5000 is to use the oarsub command: | ||
{{Term|location=grenoble|cmd=<code class="command">oarsub</code> <code class="replace">-l nodes=1,walltime=0:15</code> <code class="replace">stress -c 32 -t 10</code>}} | {{Term|location=grenoble|cmd=<code class="command">oarsub</code> <code class="replace">-l nodes=1,walltime=0:15</code> <code class="replace">"stress -c 32 -t 10"</code>}} | ||
<pre> | <pre> | ||
# Filtering out exotic resources (drac, troll, yeti). | |||
OAR_JOB_ID=2646739 | OAR_JOB_ID=2646739 | ||
</pre> | </pre> | ||
Line 91: | Line 92: | ||
Job id Name User Submission Date S Queue | Job id Name User Submission Date S Queue | ||
---------- -------------- -------------- ------------------- - ---------- | ---------- -------------- -------------- ------------------- - ---------- | ||
2646739 jdoe 2020-07-13 14:00:56 W default | 2646739 jdoe 2020-07-13 14:00:56 W default | ||
</pre> | </pre> | ||
Line 100: | Line 101: | ||
Job id Name User Submission Date S Queue | Job id Name User Submission Date S Queue | ||
---------- -------------- -------------- ------------------- - ---------- | ---------- -------------- -------------- ------------------- - ---------- | ||
2646739 jdoe 2020-07-13 14:00:56 R default | 2646739 jdoe 2020-07-13 14:00:56 R default | ||
</pre> | </pre> | ||
Line 109: | Line 110: | ||
Job id Name User Submission Date S Queue | Job id Name User Submission Date S Queue | ||
---------- -------------- -------------- ------------------- - ---------- | ---------- -------------- -------------- ------------------- - ---------- | ||
2646739 jdoe 2020-07-13 14:00:56 T default | 2646739 jdoe 2020-07-13 14:00:56 T default | ||
</pre> | </pre> | ||
And the '''T''' status means that the job has terminated. | And the '''T''' status means that the job has terminated. | ||
There are two others job status that worth to be mentioned: | There are two others job status that worth to be mentioned: | ||
Line 130: | Line 131: | ||
== GNU Parallel on top of OAR == | == GNU Parallel on top of OAR == | ||
GNU Parallel is a well known tool that can be used to execute tasks (shell scripts, executables) in parallel, on one or multiples hosts. | GNU Parallel is a well known tool that can be used to execute tasks (shell scripts, executables) in parallel, on one or multiples hosts. | ||
To a certain extent, GNU Parallel can be used with no particular assumption about the underlying infrastructure. But when combined with OAR properties of reserved resources, it is possible to distribute tasks on several hosts with an optimal distribution and in a very efficient way, thanks to the OAR's knowledge of the infrastructure. | To a certain extent, GNU Parallel can be used with no particular assumption about the underlying infrastructure. But when combined with OAR properties of reserved resources, it is possible to distribute tasks on several hosts with an optimal distribution and in a very efficient way, thanks to the OAR's knowledge of the infrastructure. | ||
Using GNU Parallel on top of OAR is described in detail in our [[GNU_Parallel|GNU Parallel tutorial]]. | Using GNU Parallel on top of OAR is described in detail in our [[GNU_Parallel|GNU Parallel tutorial]]. | ||
Compared to other solutions described | Compared to other solutions described below, '''GNU Parallel is optimal at parallelizing task on Grid5000''' because it needs only one reservation (minimal overhead), and it takes care of the distribution of computing load on reserved resources. But as always, there is no such thing as a silver bullet, and in some use cases (for instance when the number of tasks of an experiment is discovered at runtime) GNU Parallel is not the best tool at hand. | ||
== Besteffort == | == Besteffort == | ||
Best effort jobs can be used to run experiments | Best-effort jobs can be used to run experiments outside of usage policy quota or time restriction, on any resources, including those in the production queue. | ||
The | The tradeoff is that a best-effort job can be interrupted at any time (killed) if its resources are needed by a regular job. | ||
It is up to the user to decide how to manage this situation: | It is up to the user to decide how to manage this situation: | ||
Line 150: | Line 151: | ||
** (B) the OAR idempotent/checkpointing functionalities (OAR will restart the experiment until the computation is declared as terminated by the experiment) | ** (B) the OAR idempotent/checkpointing functionalities (OAR will restart the experiment until the computation is declared as terminated by the experiment) | ||
** (C) an external custom task scheduler designed to break out the experiment into multiple best-effort jobs | ** (C) an external custom task scheduler designed to break out the experiment into multiple best-effort jobs | ||
How to choose between these different ways of using the best effort mechanism depends on what the user is ready to invest into it and the nature of the experiment. | How to choose between these different ways of using the best effort mechanism depends on what the user is ready to invest into it and the nature of the experiment. | ||
(A) can be sufficient if the experiment can just be monitored by a human. The advantage is that it may require only to modify the experiment (if necessary) to produce regular checkpoints of its state (e.g. the state can be saved as a transaction in a relational database hosted on a user virtual machine; but some tools have | Option (A) can be sufficient if the experiment can just be monitored by a human. The advantage is that it may require only to modify the experiment (if necessary) to produce regular checkpoints of its state (e.g. the state can be saved as a transaction in a relational database hosted on a user virtual machine; but some tools already have a state-saving mechanism in place and can be interrupted at any time). | ||
(B) | Option (B) requires closer cooperation between OAR and the experiment. The experiment must be able: 1) to catch a signal sent by OAR to trigger the checkpointing mechanism before halt, 2) to restart from the previous last known good state using the same command line provided initially to oarsub (idempotent). This is relatively easy to do; the effort required to implement option (B) in an experiment is similar to option (A), because the difficult part is to implement a checkpoint/restart mechanism in the first place. | ||
(C) is more complex to setup, but it may be a good investment for a team to design a best-effort task scheduler to manage large scale campaigns of parallel best effort experiments. A very successful example of that on Grid'5000 is what has been done by the Caramba Team for their 2020 records (the factorization of RSA-240, a 795-bit number, and a discrete logarithm computation over a 795-bitprime field. See https://ia.cr/2020/697, section 4.2). | Option (C) is more complex to setup, but it may be a good investment for a team to design a best-effort task scheduler to manage large scale campaigns of parallel best effort experiments. A very successful example of that on Grid'5000 is what has been done by the Caramba Team for their 2020 records (the factorization of RSA-240, a 795-bit number, and a discrete logarithm computation over a 795-bitprime field. See https://ia.cr/2020/697, section 4.2). | ||
The interested reader can refer to this section of the Advanced OAR usage document (https://www.grid5000.fr/w/Advanced_OAR#Using_best_effort_mode_jobs) to know more about how | The interested reader can refer to this section of the Advanced OAR usage document (https://www.grid5000.fr/w/Advanced_OAR#Using_best_effort_mode_jobs) to know more about how to use best-effort jobs on Grid'5000. | ||
== Job containers == | == Job containers == | ||
The job container type allows the user to reserve a set of resources of interest with a single oarsub command, and then launch multiple (inner) jobs within the job container. More information on this type of job can be found in the [[Advanced_OAR#Container_jobs|Advanced OAR]] document. | The job container type allows the user to reserve a set of resources of interest with a single oarsub command, and then launch multiple (inner) jobs within the job container. More information on this type of job can be found in the [[Advanced_OAR#Container_jobs|Advanced OAR]] document. | ||
Line 175: | Line 174: | ||
Making use of the root privileges (via sudo-g5k or after kadeploying) on a node implies costly operations such as rebooting or even redeploying the reserved node after usage. This is perfectly fine as Grid'5000 is designed for that purpose, but depending on the nature of tasks to perform, it might be overkill. Indeed, using the root privilege can be avoided by using several simple techniques especially relevant in the HPC context: | Making use of the root privileges (via sudo-g5k or after kadeploying) on a node implies costly operations such as rebooting or even redeploying the reserved node after usage. This is perfectly fine as Grid'5000 is designed for that purpose, but depending on the nature of tasks to perform, it might be overkill. Indeed, using the root privilege can be avoided by using several simple techniques especially relevant in the HPC context: | ||
* Grid'5000 provides | * Grid'5000 provides [[Modules]]: you may just have to look at what modules are available to add some software components in your runtime without becoming root, just like in many classical HPC platforms. Furthermore, if the Modules library does not include a commonly used module you need, don't hesitate to ask support-staff@lists.grid5000.fr to add it. | ||
* Tooling like Conda/Bioconda (Multiple languages), Virtualenv (Python), sdkman (Java ecosystem), rvm (Ruby) etc ... can be used to install packages without the root privileges in the home directory. Since the home directory are shared via NFS, that ''user'' installation will be available on all nodes. | * Tooling like [[conda|Conda/Bioconda]] (Multiple languages), Virtualenv (Python), sdkman (Java ecosystem), rvm (Ruby) etc ... can be used to install packages without the root privileges in the home directory. Since the home directory are shared via NFS, that ''user'' installation will be available on all nodes. | ||
* Singularity, which is available using a Environment Module on Grid'5000, is also a great way to build, deploy and run elaborate pieces of software without requiring the root privileges on the node. | * [[Singularity]], which is available using a Environment Module on Grid'5000, is also a great way to build, deploy and run elaborate pieces of software without requiring the root privileges on the node. | ||
To sum up: remember that classical HPC platform do not provide access to the root privileges, hence, a classic HPC usage typically should not require the root privileges. However, a strength of Grid'5000 that differentiate it | To sum up: remember that classical HPC platform do not provide access to the root privileges, hence, a classic HPC usage typically should not require the root privileges. However, a strength of Grid'5000 that differentiate it from other HPC-capable platform is that it allows acquiring the root privileges on the bare-metal hardware machines if some tasks need it. | ||
== Short lived OAR jobs == | == Short lived OAR jobs == | ||
Line 185: | Line 184: | ||
It is a bad practice on Grid'5000 to submit a high number of OAR small jobs, which typically will last less than 10 minutes. Indeed the overhead can be significant compared to the elapsed time of each jobs, especially if each job causes a reboot or a redeployment of the system (jobs of type deploy, or using sudo-g5k). Not to mention the fact that this can slow down the overall performance of the OAR server and its derivatives, such as the Gantt web interface. | It is a bad practice on Grid'5000 to submit a high number of OAR small jobs, which typically will last less than 10 minutes. Indeed the overhead can be significant compared to the elapsed time of each jobs, especially if each job causes a reboot or a redeployment of the system (jobs of type deploy, or using sudo-g5k). Not to mention the fact that this can slow down the overall performance of the OAR server and its derivatives, such as the Gantt web interface. | ||
The best option is to '''use GNU Parallel on top of OAR''' and as such submit OAR jobs at a coarser grain (more than 10 | The best option is to '''use GNU Parallel on top of OAR''' and as such submit OAR jobs at a coarser grain (more than 10 minutes, preferably on full nodes rather than a core) and possibly running the smaller tasks (in duration and resources granularity) within those OAR jobs. | ||
Also if a job requires a modification of the nodes in privileged mode (root), it is preferable to run longer jobs with a deployment or the use of sudo-g5k at the beginning, and then launch the smaller tasks, rather than having every small job do the initial | Also if a job requires a modification of the nodes in privileged mode (root), it is preferable to run longer jobs with a deployment or the use of sudo-g5k at the beginning, and then launch the smaller tasks, rather than having every small job do the initial privileged operations. Note that when reproducibility is at stake, one may want the reboot and redeploy before every run of a task, but this in turn does not require a new job: in a single deploy job, one can use kareboot and kadeploy as many times as needed. | ||
== Full node reservation vs smaller resource reservation == | == Full node reservation vs smaller resource reservation == | ||
Line 197: | Line 196: | ||
== Big datasets == | == Big datasets == | ||
Users quota on each site frontend ranges from 25GB to a maximum of | Users quota on each site frontend ranges from 25GB to a maximum of 200GB ([https://api.grid5000.fr/stable/users/#myaccount the User Management Service] can be used to ask for more disk space). It is worth noting that users's home directories are located on a NFS file server, and as such, are accessible from every reserved node when using an [[Getting_Started#Deploying_nodes_with_Kadeploy|NFS capable environment]]. | ||
Above that limit of 200GB, it is possible to use the [[Group_Storage|Group Storage]] service, also based on NFS, which is very flexible, and designed to encourage team data sharing on a larger scale (multiple TBs). | Above that limit of 200GB, it is possible to use the [[Group_Storage|Group Storage]] service, also based on NFS, which is very flexible, and designed to encourage team data sharing on a larger scale (multiple TBs). | ||
Line 207: | Line 206: | ||
In all cases, user's data created on the node are deleted after the reservation, either by the removal of all the user's files in the user writable directories (eg: /tmp) when the node has been used without requiring root rights, or by redeploying the full node at the end of the reservation if the node has been [[Getting_Started#Deploying_nodes_with_Kadeploy|deployed by the user]] of if [[Sudo-g5k|sudo-g5k]] has been used. | In all cases, user's data created on the node are deleted after the reservation, either by the removal of all the user's files in the user writable directories (eg: /tmp) when the node has been used without requiring root rights, or by redeploying the full node at the end of the reservation if the node has been [[Getting_Started#Deploying_nodes_with_Kadeploy|deployed by the user]] of if [[Sudo-g5k|sudo-g5k]] has been used. | ||
On some clusters, it is also possible '''[[Disk_reservation|to reserve dedicated local disks]]''' up to 2 weeks, independently of the node reservation itself (of course it is possible to reserve both the node and its local reservable disks at the same time). | On some clusters, it is also possible '''[[Disk_reservation|to reserve dedicated local disks]]''' up to 2 weeks, independently of the node reservation itself (of course it is possible to reserve both the node and its local reservable disks at the same time). | ||
It is worth noting that these local disk are not accessible to users which have not reserved them (this is not quite true as a skilled and malicious user can bypass this limitation, if it happens, don't hesitate to [[Disk_reservation#Security_issues|report the situation]] to support-staff@grid5000.fr). | It is worth noting that these local disk are not accessible to users which have not reserved them (this is not quite true as a skilled and malicious user can bypass this limitation, if it happens, don't hesitate to [[Disk_reservation#Security_issues|report the situation]] to support-staff@lists.grid5000.fr). | ||
== Backup == | == Backup == | ||
Line 221: | Line 220: | ||
Note that Grid'5000 allows anybody to initiate data transfer using the SCP/SFTP protocols (e.g. with <code class=command>rsync</code>) to a target in the Internet. However, you may encounter difficulties at the target entrance, due to security constraints imposed by the HPC center. | Note that Grid'5000 allows anybody to initiate data transfer using the SCP/SFTP protocols (e.g. with <code class=command>rsync</code>) to a target in the Internet. However, you may encounter difficulties at the target entrance, due to security constraints imposed by the HPC center. | ||
We provide an entry in our FAQ to describe how to proceed in the case of the [[FAQ#Access_to_the_Jean_Zay_supercomputer_.28and_possibly_others_GENCI_supercomputers.29|Jean Zay supercomputer]] (and probably other GENCI supercomputers). | We provide an entry in our FAQ to describe how to proceed in the case of the [[FAQ#Access_to_the_Jean_Zay_supercomputer_.28and_possibly_others_GENCI_supercomputers.29|Jean Zay supercomputer]] (and probably other GENCI supercomputers). | ||
Please, don't hesitate to drop an e-mail at [mailto:support-staff@lists.grid5000.fr support-staff] about your data transfer experience related to an HPC center (successful or not); we will add it to the FAQ. | |||
== Sensitive data == | |||
Remember that sensitive data cannot be processed by Grid'5000 nodes unless [[Armored Node for Sensitive Data|specific safeguards are used]]. | |||
= Using HPC hardware and software on Grid'5000 = | = Using HPC hardware and software on Grid'5000 = | ||
Line 232: | Line 235: | ||
* Installing [[Deep Learning Frameworks]] on Grid'5000 | * Installing [[Deep Learning Frameworks]] on Grid'5000 | ||
* Running [[Run_MPI_On_Grid'5000|MPI applications on Grid'5000]] | * Running [[Run_MPI_On_Grid'5000|MPI applications on Grid'5000]] | ||
* Using software provided as [[ | * Using software provided as [[Modules]] | ||
* Installing software with [[Conda]] | |||
* Using [[Guix|Guix package manager]] | |||
* SPMD / Many Task computing using [[GNU Parallel]] on top of OAR | * SPMD / Many Task computing using [[GNU Parallel]] on top of OAR | ||
* Using [[Singularity]] (thanks to [[ | * Using [[Singularity]] (thanks to [[Modules]]) |
Latest revision as of 14:16, 7 March 2024
Note | |
---|---|
This page is actively maintained by the Grid'5000 team. If you encounter problems, please report them (see the Support page). Additionally, as it is a wiki page, you are free to make minor corrections yourself if needed. If you would like to suggest a more fundamental change, please contact the Grid'5000 team. |
Grid'5000 gives an easy access to a wide variety of hardware technologies and is particularly suitable to carry out HPC (high performance computing) experiments: users can investigate parallel algorithms, scalability problems or performance portability on Grid'5000. The first intent of Grid'5000 is to be a testbed for experiment-driven research in all areas of computer science with a focus on parallel and distributed computing including Cloud, HPC and Big Data.
However, Grid5000 offers such a large amount of resources that it allows the use of its idle resources for workloads which are more production oriented (The goal is just to obtain results faster, without regard to the method that is used).
Those include HTC (High-throughput computing) projects requiring the execution of a large number of loosely-coupled tasks (also called an embarrassingly parallel workload).
Whereas HPC production systems generally have rather rigid restrictions (no root access, no possibility to install system-wide software, no ssh connection to the compute nodes, no internet access...), Grid'5000 does not suffer from these common limitations of HPC centers.
In particular, Grid'5000 has a job scheduling policy that allow reservations in advance of resources, which is useful for setting up an experiment on your own schedule and preform some interactive tasks on the reserved resources.
You can also reinstall cluster nodes and gain root access during the time of your jobs using Kadeploy. This can be used to control the entire software stack, experiments with runtime environments, fine-tune network parameters (ex. MTU) or to simply ensure the reproducibility of your experiments by freezing its context.
In addition, Grid'5000 provides a set of tools for monitoring experiments.
Resource available on Grid'5000
The easiest way to get the global picture of the HPC systems available on Grid'5000 is to consult the Hardware page.
This page is built using the Grid'5000 Reference API and describes in detail the CPU models, network interfaces and accelerators of each cluster.
You can also use the API Quick Start page as it provides advanced filters for selecting nodes by hardware capability. Alternatively, you can parse the Grid'5000 Reference API yourself to discover the available resources on each site.
Resource reservation
Resource reservation using the OAR scheduler is covered by the Getting Started tutorial and more deeply in the Advanced OAR document.
You can select specific hardware by using the "-p" (properties) option of the oarsub
command.
The list of OAR Properties available on each site is listed in each Monika status page (links in the Status page). For instance, see the Monika job status page for Nancy.
You can combine OAR properties or even use SQL queries for advance filtering.
Here is a non-exhaustive list of OAR properties for HPC experiments:
- CPU:
cpuarch
,cpucore
,cpufreq
,cputype
- Memory (RAM in MB):
memnode
(memory per node),memcpu
(per cpu),memcore
(per core) - Network:
eth_count
(number of ethernet interfaces),eth_rate
(rate of the fastest ethernet interface available on the node)ib_count
(number of InfiniBand interfaces),ib = {'NO', 'SDR', 'DDR', 'QDR', 'FDR'}
(the InfiniBand technology available),ib_rate = {0, 10, 20, 40, 56}
(max rate in Gbit/s)
- Accelerator:
gpu_model
(model of GPU),gpu_count
(number of GPU per node)
For example, you can make a reservation at Lyon for a GPU node using:
Or get a node with at least 256 Go of RAM at Nancy:
Some resources are available only through specific settings:
- production queue, currently only in Nancy: access to production resources is reserved only to members of a Gold level Group Granting Access
- exotic resources: these resources are special in some way (non-x86 architecture, exotic or rare hardware), so they are only usable by experimenters that explicitly ask resources of type "exotic".
Example to get a reservation for a host with a GPU in Nancy (available in production queue only):
Or to get a reservation on our ARM cluster in Lyon
Reservations are subject to rules described in the Usage Policy document. Computational quota on Grid'5000 differs from what is offered traditionally in HPC centers; the idea for the default queue (available in all sites) is to let the resources usable by a lot of different people during the week day to prepare experiments or run limited experiment (no more than 2 core hours of a full cluster), and to use nights and week-ends for out of quota experiments.
A week-end may feel like a short time for HPC jobs. For longer jobs, Grid'5000 offers the following solutions:
- use, if granted, the production queue: there is no daily quota, but jobs are limited to one week of computation, and not on a full cluster (eg: 2/3 of a 64 nodes cluster)
- use the so-called best-effort mode to submit jobs, available on every site, in every queue, but this requires the user to adapt his/her experiment to this mode of computation (more on this below)
Different ways to use Grid'5000 as an HPC production system
Oarsub
A simple way to submit a job on Grid'5000 is to use the oarsub command:
# Filtering out exotic resources (drac, troll, yeti). OAR_JOB_ID=2646739
(in this example, the job will run the stress -c 32 -t 10
command)
- Job status
The returned job id can be used to query OAR about the status of the job:
Job id Name User Submission Date S Queue ---------- -------------- -------------- ------------------- - ---------- 2646739 jdoe 2020-07-13 14:00:56 W default
The W status above means that the job is waiting for its required resources to be available.
Job id Name User Submission Date S Queue ---------- -------------- -------------- ------------------- - ---------- 2646739 jdoe 2020-07-13 14:00:56 R default
The R status means that the job is currently running.
Job id Name User Submission Date S Queue ---------- -------------- -------------- ------------------- - ---------- 2646739 jdoe 2020-07-13 14:00:56 T default
And the T status means that the job has terminated.
There are two others job status that worth to be mentioned:
- E (error) which indicates that the job has been interrupted because it has exceeded its allocated walltime, or any other unexpected issue
- F (finishing) which means that the job is about to terminate
- Job stdout and stderr streams
Standard output and error of the job are stored in files named after the job id returned by oar:
for instance contains:
stress: info: [7005] dispatching hogs: 32 cpu, 0 io, 0 vm, 0 hdd stress: info: [7005] successful run completed in 10s
is empty because nothing was written by the command on the standard error stream.
GNU Parallel on top of OAR
GNU Parallel is a well known tool that can be used to execute tasks (shell scripts, executables) in parallel, on one or multiples hosts.
To a certain extent, GNU Parallel can be used with no particular assumption about the underlying infrastructure. But when combined with OAR properties of reserved resources, it is possible to distribute tasks on several hosts with an optimal distribution and in a very efficient way, thanks to the OAR's knowledge of the infrastructure.
Using GNU Parallel on top of OAR is described in detail in our GNU Parallel tutorial.
Compared to other solutions described below, GNU Parallel is optimal at parallelizing task on Grid5000 because it needs only one reservation (minimal overhead), and it takes care of the distribution of computing load on reserved resources. But as always, there is no such thing as a silver bullet, and in some use cases (for instance when the number of tasks of an experiment is discovered at runtime) GNU Parallel is not the best tool at hand.
Besteffort
Best-effort jobs can be used to run experiments outside of usage policy quota or time restriction, on any resources, including those in the production queue.
The tradeoff is that a best-effort job can be interrupted at any time (killed) if its resources are needed by a regular job.
It is up to the user to decide how to manage this situation:
- (A) checkpoint regularly the state of the experiment in order to restart it manually from the last known good state
- add a checkpoint/restart mechanism to the experiment to be able to automatically resume the experiment, either using:
- (B) the OAR idempotent/checkpointing functionalities (OAR will restart the experiment until the computation is declared as terminated by the experiment)
- (C) an external custom task scheduler designed to break out the experiment into multiple best-effort jobs
How to choose between these different ways of using the best effort mechanism depends on what the user is ready to invest into it and the nature of the experiment.
Option (A) can be sufficient if the experiment can just be monitored by a human. The advantage is that it may require only to modify the experiment (if necessary) to produce regular checkpoints of its state (e.g. the state can be saved as a transaction in a relational database hosted on a user virtual machine; but some tools already have a state-saving mechanism in place and can be interrupted at any time).
Option (B) requires closer cooperation between OAR and the experiment. The experiment must be able: 1) to catch a signal sent by OAR to trigger the checkpointing mechanism before halt, 2) to restart from the previous last known good state using the same command line provided initially to oarsub (idempotent). This is relatively easy to do; the effort required to implement option (B) in an experiment is similar to option (A), because the difficult part is to implement a checkpoint/restart mechanism in the first place.
Option (C) is more complex to setup, but it may be a good investment for a team to design a best-effort task scheduler to manage large scale campaigns of parallel best effort experiments. A very successful example of that on Grid'5000 is what has been done by the Caramba Team for their 2020 records (the factorization of RSA-240, a 795-bit number, and a discrete logarithm computation over a 795-bitprime field. See https://ia.cr/2020/697, section 4.2).
The interested reader can refer to this section of the Advanced OAR usage document (https://www.grid5000.fr/w/Advanced_OAR#Using_best_effort_mode_jobs) to know more about how to use best-effort jobs on Grid'5000.
Job containers
The job container type allows the user to reserve a set of resources of interest with a single oarsub command, and then launch multiple (inner) jobs within the job container. More information on this type of job can be found in the Advanced OAR document.
Job containers are mainly useful for reserving resources for tutorials or teaching labs (see the Tutorial or teaching labs How-To), because they allow reserving/sharing resources for/with multiple users. However, they can be handy for HPC tasks whenever the jobs of an experiment have to be submitted on-line (not all at once at the beginning), since that use case is not supported using GNU Parallel.
Gotchas and good practices
Jobs requiring root privileges
Making use of the root privileges (via sudo-g5k or after kadeploying) on a node implies costly operations such as rebooting or even redeploying the reserved node after usage. This is perfectly fine as Grid'5000 is designed for that purpose, but depending on the nature of tasks to perform, it might be overkill. Indeed, using the root privilege can be avoided by using several simple techniques especially relevant in the HPC context:
- Grid'5000 provides Modules: you may just have to look at what modules are available to add some software components in your runtime without becoming root, just like in many classical HPC platforms. Furthermore, if the Modules library does not include a commonly used module you need, don't hesitate to ask support-staff@lists.grid5000.fr to add it.
- Tooling like Conda/Bioconda (Multiple languages), Virtualenv (Python), sdkman (Java ecosystem), rvm (Ruby) etc ... can be used to install packages without the root privileges in the home directory. Since the home directory are shared via NFS, that user installation will be available on all nodes.
- Singularity, which is available using a Environment Module on Grid'5000, is also a great way to build, deploy and run elaborate pieces of software without requiring the root privileges on the node.
To sum up: remember that classical HPC platform do not provide access to the root privileges, hence, a classic HPC usage typically should not require the root privileges. However, a strength of Grid'5000 that differentiate it from other HPC-capable platform is that it allows acquiring the root privileges on the bare-metal hardware machines if some tasks need it.
Short lived OAR jobs
It is a bad practice on Grid'5000 to submit a high number of OAR small jobs, which typically will last less than 10 minutes. Indeed the overhead can be significant compared to the elapsed time of each jobs, especially if each job causes a reboot or a redeployment of the system (jobs of type deploy, or using sudo-g5k). Not to mention the fact that this can slow down the overall performance of the OAR server and its derivatives, such as the Gantt web interface.
The best option is to use GNU Parallel on top of OAR and as such submit OAR jobs at a coarser grain (more than 10 minutes, preferably on full nodes rather than a core) and possibly running the smaller tasks (in duration and resources granularity) within those OAR jobs.
Also if a job requires a modification of the nodes in privileged mode (root), it is preferable to run longer jobs with a deployment or the use of sudo-g5k at the beginning, and then launch the smaller tasks, rather than having every small job do the initial privileged operations. Note that when reproducibility is at stake, one may want the reboot and redeploy before every run of a task, but this in turn does not require a new job: in a single deploy job, one can use kareboot and kadeploy as many times as needed.
Full node reservation vs smaller resource reservation
In order to limit the resources fragmentation, and since many jobs in Grid'5000 require full nodes (requirement for a deployment or for using sudo-g5k), it is advised to prefer submitting jobs using full nodes rather than smaller resources such as cores. Only when a user just want 1 core or more relevantly 1 GPU for a single run, it is wise to reserve as such granularity, so that unused resources within the job are not wasted.
Whenever the task execution is spread at the GPU or core grain on many of those resources, using coarser grain OAR job (full nodes) and GNU parallel in the job for the small task execution is wise.
Big datasets
Users quota on each site frontend ranges from 25GB to a maximum of 200GB (the User Management Service can be used to ask for more disk space). It is worth noting that users's home directories are located on a NFS file server, and as such, are accessible from every reserved node when using an NFS capable environment.
Above that limit of 200GB, it is possible to use the Group Storage service, also based on NFS, which is very flexible, and designed to encourage team data sharing on a larger scale (multiple TBs).
User data on compute nodes
Data for experiment can be transferred and produced directly on the local disks of the reserved nodes.
In all cases, user's data created on the node are deleted after the reservation, either by the removal of all the user's files in the user writable directories (eg: /tmp) when the node has been used without requiring root rights, or by redeploying the full node at the end of the reservation if the node has been deployed by the user of if sudo-g5k has been used.
On some clusters, it is also possible to reserve dedicated local disks up to 2 weeks, independently of the node reservation itself (of course it is possible to reserve both the node and its local reservable disks at the same time).
It is worth noting that these local disk are not accessible to users which have not reserved them (this is not quite true as a skilled and malicious user can bypass this limitation, if it happens, don't hesitate to report the situation to support-staff@lists.grid5000.fr).
Backup
Grid'5000 does not offer a user data backup service, so you have to take care of this by yourself.
File transfers between Grid'5000 and HPC centers
It is of course more efficient to transfer data directly from Grid'5000 to a HPC centers you are using, than let the data transit via your lab during the transfer (be it done in two steps or just one using for instance scp -3
)
Note that Grid'5000 allows anybody to initiate data transfer using the SCP/SFTP protocols (e.g. with rsync
) to a target in the Internet. However, you may encounter difficulties at the target entrance, due to security constraints imposed by the HPC center.
We provide an entry in our FAQ to describe how to proceed in the case of the Jean Zay supercomputer (and probably other GENCI supercomputers).
Please, don't hesitate to drop an e-mail at support-staff about your data transfer experience related to an HPC center (successful or not); we will add it to the FAQ.
Sensitive data
Remember that sensitive data cannot be processed by Grid'5000 nodes unless specific safeguards are used.
Using HPC hardware and software on Grid'5000
Usual HPC software and hardware can be used on Grid'5000; the following list of tutorials may help the interested reader to quickly use them:
- Using Accelerators on Grid'5000
- Installing Deep Learning Frameworks on Grid'5000
- Running MPI applications on Grid'5000
- Using software provided as Modules
- Installing software with Conda
- Using Guix package manager
- SPMD / Many Task computing using GNU Parallel on top of OAR
- Using Singularity (thanks to Modules)