HPC and HTC tutorial: Difference between revisions

From Grid5000
Jump to navigation Jump to search
No edit summary
No edit summary
Line 1: Line 1:
Grid'5000 gives an easy access to a wide variety of hardware technologies and is particulary suitable to carry out HPC (high performance computing) experiments: Users can investigate parallel algorithms, scalability problems or performance portability on Grid'5000. Whereas HPC production systems generally have rather rigid restrictions (no root access, no possibility to install system-wide software, no
'''Grid'5000''' gives an easy access to a '''wide variety of hardware''' technologies and is particulary suitable to '''carry out HPC (high performance computing) experiments''': Users can investigate '''parallel algorithms''', '''scalability problems''' or '''performance portability''' on Grid'5000. Whereas HPC production systems generally have rather rigid restrictions (no root access, no possibility to install system-wide software, no ssh connection to the compute nodes, no internet access...), Grid'5000 does not suffer of these common limitations of HPC systems. In particular, Grid'5000 has a job scheduling policy that allow '''reservations in advance of ressources''' which is useful for setting up an experiment on your own schedule. You can also reinstall cluster nodes and '''gain root access''' during the time of your jobs using [[Advanced_Kadeploy|Kadeploy]]. This can be used to control the entire software stack, experiments with runtime environments, fine-tune network parameters (ex. MTU)  or to simply ensure the reproducibility of your experiments by freezing its context. In addition, Grid'5000 provides a set of tools for [[Monitoring|monitoring]] experiments that you might find especially useful for detecting problems such as network contentions on distributed algorithms.
ssh connection to the compute nodes, no internet access...), Grid'5000 does not suffer of these common limitations of HPC systems. In particular, Grid'5000 has a job scheduling policy that allow reservations in advance of ressources which is useful for setting up an experiment on your own schedule. You can also reinstall cluster nodes and gain root access during the time of your jobs using [[Advanced_Kadeploy Kadeploy]].
This can be used to control the entire software stack, experiments with runtime environments, fine-tune network parameters (ex. MTU)  or to simply ensure the reproducibility of your experiments by freezing its context. In addition, Grid'5000 provides a set of tools for [[Monitoring|monitoring]] experiments that you might find especially useful for detecting problems such as network contentions on distributed algorithms.


= Discovering HPC resources on Grid'5000 =
= Discovering HPC resources on Grid'5000 =


The easiest way to get the global picture of the HPC systems available on Grid'5000 is to consult the [[Special:G5KHardware]] page.
The easiest way to get the global picture of the '''HPC systems available on Grid'5000''' is to consult the [[Special:G5KHardware|Hardware]] page.
This page is built using the [[API_all_in_one_Tutorial|Grid'5000 Reference API]] and describes in details the CPU models, network interfaces and accelerators of each clusters.
This page is built using the [[API_all_in_one_Tutorial|Grid'5000 Reference API]] and describes in details the '''CPU''' models, '''network interfaces''' and '''accelerators''' of each clusters.
You can also use the [https://api.grid5000.fr/sid/ui/quick-start.html API Quick Start page]] as it provides advanced filters for selecting nodes by hardware capability.
You can also use the [https://api.grid5000.fr/sid/ui/quick-start.html API Quick Start page]] as it provides advanced filters for selecting nodes by hardware capability.
Alternatively, you can parse the [[API_all_in_one_Tutorial|Grid'5000 Reference API]] yourself to discover the available resources on each site.
Alternatively, you can parse the [[API_all_in_one_Tutorial|Grid'5000 Reference API]] yourself to discover the available resources on each site.
Line 12: Line 10:
= Resource reservation on Grid'5000 =
= Resource reservation on Grid'5000 =


Resource reservation using the OAR scheduler is covered by the [Getting_Started] tutorial. You can select specific hardware by using the "-p" (properties) option of the oarsub command.
Resource reservation using the '''OAR scheduler''' is covered by the [Getting_Started] tutorial. You can select specific hardware by using the "-p" (properties) option of the oarsub command.
The list of properties available on each site are listed on the Monika pages linked from the [[Status]] page. For instance, see the [https://intranet.grid5000.fr/oar/Nancy/monika.cgi Monika page for Nancy].
The list of '''properties''' available on each site are listed on the '''Monika''' pages linked from the [[Status]] page. For instance, see the [https://intranet.grid5000.fr/oar/Nancy/monika.cgi Monika page for Nancy].
You can combined OAR properties or even use SQL queries for advance filtering.
You can combined OAR properties or even use SQL queries for advance filtering.


Here is a non exhaustive list of OAR properties for HPC experiments:
Here is a non exhaustive list of OAR properties for HPC experiments:
    * CPU: cpuarch, cpucore, cpufreq, cputype
* CPU: cpuarch, cpucore, cpufreq, cputype
    * Memory (RAM in Mo): memnode (memory per node), memcpu (per cpu), memcore (per core)
* Memory (RAM in Mo): memnode (memory per node), memcpu (per cpu), memcore (per core)
    * Network: ethnb (number of network interfaces), eth10g, myri10g (myrinet), ib20g (infiniband), ib40g
* Network: ethnb (number of network interfaces), eth10g, myri10g (myrinet), ib20g (infiniband), ib40g
    * Accelerator: gpu (=YES/SHARED/NO), gpu_count (number of GPU per node), mic (YES/NO)
* Accelerator: gpu (=YES/SHARED/NO), gpu_count (number of GPU per node), mic (YES/NO)


For example, you can make a reservation at Lyon for a GPU node using:
For example, you can make a reservation at Lyon for a GPU node using:
Line 30: Line 28:
= Using Grid'5000 resources as a HPC production system =
= Using Grid'5000 resources as a HPC production system =


The first intent of Grid'5000 is to be a testbed for experiment-driven research in all areas of computer science with a focus on parallel and distributed computing including Cloud, HPC and Big Data. However, Grid5000 offers such a large amount of resources that it allows the use of its idle resources for more production oriented projects. These include [https://en.wikipedia.org/wiki/High-throughput_computing HTC (High-throughput computing)] projects requiring the execution of a large number of loosely-coupled tasks. This usage of Grid'5000 is only allowed for projects in connection with computer science research and the jobs must be submitted using the "besteffort" mode of the scheduler.  
The first intent of Grid'5000 is to be a testbed for '''experiment-driven research in all areas of computer science''' with a focus on parallel and distributed computing including Cloud, HPC and Big Data. However, Grid5000 offers such a '''large amount of resources''' that it allows the use of its idle resources for more '''production oriented projects'''. These include [https://en.wikipedia.org/wiki/High-throughput_computing HTC (High-throughput computing)] projects requiring the execution of a '''large number of loosely-coupled tasks'''. This usage of Grid'5000 is only allowed for projects in connection with computer science research and the jobs must be submitted using the '''besteffort"' mode of the scheduler.  


Besteffort jobs are executed on idle resources and are killed when a regular jobs request the resources. Therefore, besteffort jobs need to be monitored. In particular, long-running jobs should implement a checkpointing mecanism. If your job is of the type besteffort and idempotent (oarsub -t besteffort -t idempotent) and killed by the OAR scheduler, then another job is automatically created and put in the queue with same configuration (note that your job is also resubmitted if the exit code of your program is 99).
Besteffort jobs are executed on idle resources and are killed when a regular jobs request the resources. Therefore, besteffort jobs need to be monitored. In particular, long-running jobs should implement a checkpointing mecanism. If your job is of the type '''besteffort and idempotent''' (oarsub -t besteffort -t idempotent) and killed by the OAR scheduler, then another job is automatically created and put in the queue with same configuration (note that your job is also resubmitted if the exit code of your program is 99).


Applications that submit a large amount of tasks (ie. Bag-of-Tasks campaigns) should consider using CiGri, a grid middleware that dispatches the workload on the whole Grid'5000 infrastructure and equally distribute the idle resources among its users without overloading the infrastructure.
Applications that submit a large amount of tasks (ie. Bag-of-Tasks campaigns) should consider using '''CiGri''', a grid middleware that dispatches the workload on the whole Grid'5000 infrastructure and equally distribute the idle resources among its users without overloading the infrastructure.


= Using HPC hardware on Grid'5000 =
= Using HPC hardware on Grid'5000 =


The rest of this tutorial is divided into distinct parts that can be done in any order:
The rest of this tutorial is divided into distinct parts that can be done in any order:
* Using [[Accelerators_on_Grid5000]]
* [[Accelerators_on_Grid5000|Using Accelerators on Grid'5000]]
* [[Run_MPI_On_Grid'5000 Running MPI applications on Grid'5000]]
* [[Run_MPI_On_Grid'5000|Running MPI applications on Grid'5000]]
* [[CiGri Running multi-parametric experiments with the CiGri middleware]]
* [[CiGri|Running multi-parametric experiments with the CiGri middleware]]

Revision as of 14:57, 21 January 2016

Grid'5000 gives an easy access to a wide variety of hardware technologies and is particulary suitable to carry out HPC (high performance computing) experiments: Users can investigate parallel algorithms, scalability problems or performance portability on Grid'5000. Whereas HPC production systems generally have rather rigid restrictions (no root access, no possibility to install system-wide software, no ssh connection to the compute nodes, no internet access...), Grid'5000 does not suffer of these common limitations of HPC systems. In particular, Grid'5000 has a job scheduling policy that allow reservations in advance of ressources which is useful for setting up an experiment on your own schedule. You can also reinstall cluster nodes and gain root access during the time of your jobs using Kadeploy. This can be used to control the entire software stack, experiments with runtime environments, fine-tune network parameters (ex. MTU) or to simply ensure the reproducibility of your experiments by freezing its context. In addition, Grid'5000 provides a set of tools for monitoring experiments that you might find especially useful for detecting problems such as network contentions on distributed algorithms.

Discovering HPC resources on Grid'5000

The easiest way to get the global picture of the HPC systems available on Grid'5000 is to consult the Hardware page. This page is built using the Grid'5000 Reference API and describes in details the CPU models, network interfaces and accelerators of each clusters. You can also use the API Quick Start page] as it provides advanced filters for selecting nodes by hardware capability. Alternatively, you can parse the Grid'5000 Reference API yourself to discover the available resources on each site.

Resource reservation on Grid'5000

Resource reservation using the OAR scheduler is covered by the [Getting_Started] tutorial. You can select specific hardware by using the "-p" (properties) option of the oarsub command. The list of properties available on each site are listed on the Monika pages linked from the Status page. For instance, see the Monika page for Nancy. You can combined OAR properties or even use SQL queries for advance filtering.

Here is a non exhaustive list of OAR properties for HPC experiments:

  • CPU: cpuarch, cpucore, cpufreq, cputype
  • Memory (RAM in Mo): memnode (memory per node), memcpu (per cpu), memcore (per core)
  • Network: ethnb (number of network interfaces), eth10g, myri10g (myrinet), ib20g (infiniband), ib40g
  • Accelerator: gpu (=YES/SHARED/NO), gpu_count (number of GPU per node), mic (YES/NO)

For example, you can make a reservation at Lyon for a GPU node using:

Terminal.png flyon:
oarsub -I -p "gpu='YES'"

Or get a node with at least 256 Go of RAM at nancy:

Terminal.png fnancy:
oarsub -I -p "memnode>256000"

Using Grid'5000 resources as a HPC production system

The first intent of Grid'5000 is to be a testbed for experiment-driven research in all areas of computer science with a focus on parallel and distributed computing including Cloud, HPC and Big Data. However, Grid5000 offers such a large amount of resources that it allows the use of its idle resources for more production oriented projects. These include HTC (High-throughput computing) projects requiring the execution of a large number of loosely-coupled tasks. This usage of Grid'5000 is only allowed for projects in connection with computer science research and the jobs must be submitted using the besteffort"' mode of the scheduler.

Besteffort jobs are executed on idle resources and are killed when a regular jobs request the resources. Therefore, besteffort jobs need to be monitored. In particular, long-running jobs should implement a checkpointing mecanism. If your job is of the type besteffort and idempotent (oarsub -t besteffort -t idempotent) and killed by the OAR scheduler, then another job is automatically created and put in the queue with same configuration (note that your job is also resubmitted if the exit code of your program is 99).

Applications that submit a large amount of tasks (ie. Bag-of-Tasks campaigns) should consider using CiGri, a grid middleware that dispatches the workload on the whole Grid'5000 infrastructure and equally distribute the idle resources among its users without overloading the infrastructure.

Using HPC hardware on Grid'5000

The rest of this tutorial is divided into distinct parts that can be done in any order: