Production

From Grid5000
Jump to navigation Jump to search


Introduction

The Nancy and Rennes Grid'5000 sites also hosts clusters for production use (including clusters with GPUs). See Nancy:Hardware and Rennes:Hardware for details.

The usage rules differ from the rest of Grid'5000:

  • Advance reservations (oarsub -r) are not allowed (to avoid fragmentation). Only submissions (and reservations that start immediately) are allowed.
  • All Grid'5000 users can use those nodes (provided they meet the conditions stated in Grid5000:UsagePolicy), but it is expected that users outside of LORIA / Centre Inria Nancy -- Grand Est and IRISA / Centre Inria de l'Université de Rennes will use their own local production resources in priority, and mostly use those resources for tasks that require Grid'5000 features. Examples of local production clusters are Cleps (Paris), Margaret (Saclay), Plafrim (Bordeaux), etc.

Using the resources

Getting an account

Users from the Loria laboratory (LORIA/Centre Inria Nancy Grand-Est) and the Irisa (IRISA/Centre Inria de l'Université de Rennes) that want to access Grid'5000 primarily for a production usage must use that request form to open an account, like regular Grid'5000 users.

  • The following fields must be filled as follows:
    • Group Granting Access (GGA): either the group named after the research team, or if it does not belong to the team list below: loria (for Nancy) or igrida (for Rennes).
    • Laboratory: LORIA or IRISA
    • Team: INTUIDOC, SYNALP, LACODAM, MULTISPEECH, SERPICO, CARAMBA, CAPSID, SIROCCO, ORPAILLEUR, LARSEN, CIDRE, SEMAGRAMME, LINKMEDIA, SISR, TANGRAM...

Other users from Nancy (not belonging to the Loria laboratory) can ask to join using the nancy-misc Group Granting Access while other users from Rennes (not belonging to the Irisa laboratory) can ask to join using the rennes-misc Group Granting Access.

  • Users are automatically subscribed to the Grid'5000 users mailing lists: users@lists.grid5000.fr. This list is the user-to-user or user-to-admin communication mean to address help/support requests for Grid'5000. The technical team can be reached on support-staff@lists.grid5000.fr.

Learning to use Moyens de Calcul hosted by Grid'5000

Refer to the Production:Getting Started Production tutorial (derived from Getting Started Grid'5000 tutorial. There are other tutorial listed on the Users Home page.

Using deep learning software on Grid'5000

A tutorial for using deep learning software on Grid'5000, written by Ismael Bada is also available.

Using production resources

To access production resources, you need to submit jobs to the production queue using the command -q production. Job submissions in the production queue are prioritized based on who funded the material. There are four levels of priority, each with a maximum job duration:

  • p1 -- 168h (one week)
  • p2 -- 96h (four days)
  • p3 -- 48h (two days)
  • p4 -- 24h (one day)
  • You may also have access to the clusters on besteffort.


Note.png Note

Moreover, with p1 priority, user can submit advanced reservation. More information about that in the Advanced OAR Page. For example, to reserve one week from now:

Terminal.png fnancy:
oarsub -q p1 -r "$(date +'%F %T' --date='+1 week')"
p1 priority level also allow to extend the duration of a job. The extension is only apply 24h before the end of the job and cannot be longer than 168h. More information about this feature can be found also on the Advance Oar Page.

Warning.png Warning

These limits DO NOT replace the maximum walltime per node which are still in effects.

You can check your priority level for any cluster using https://api.grid5000.fr/explorer.

Note.png Note

As of today, the resources explorer only shows basic information. Additional information will be added in the near future.

When submitting a job, by default, you will be placed at the highest priority level that allows you to maximize resources:

Terminal.png fnancy:
oarsub -q production -I

Using the command above will generally place your job at the lowest priority to allow usage of all clusters, even those where your priority is p4.


When you specify a cluster, your job will be set to your highest priority level for that cluster:

Terminal.png fnancy:
oarsub -q production -p grele -I


You can also limit a job submission to a cluster at a specific priority level using -qPRIORITY LEVEL:

Terminal.png fnancy:
oarsub -q p2 -l nodes=2,walltime=90 './yourScript.py'

Dashboards and status pages

Nancy

Rennes

Contact information and support

For support, see the Support page.

Contacts:

FAQ

Data storage

Research teams, people of different teams, individuals can ask for different Group storages in order to manage their data at the team level. The main benefit of using Group storages is that they allow for the members of the group to share their data (corpus, datasets, results ...) and to overcome easily the quota restrictions of the home directories.

Please remember that NFS servers (the home directories are also served by a NFS server) are quite slow when it comes to process a huge amount of small files during a computation, and if your are in this case, you may consider to do the major part of your I/Os on the nodes and copy back the results on the NFS server at the end of the experiment.

See here for other kind of storage available on the platform.

Nancy

Group storages are used to control the access to different storage spaces located on the storage[1-5].nancy.grid5000.fr NFS servers (more information about the maximum capacities of each of these server can be found here). Ask to your GGA leader if your team have access to one or more storage spaces (this is the case for instance for the following teams: Bird, Capsid, Caramba, Heap, Multispeech, Optimist, Orpailleur, Semagramme, Sisr, Synalp, Tangram).

Rennes

Group storages are used to control the access to different storage spaces located on the storage2.rennes.grid5000.fr NFS server (more information about the maximum capacities of these server can be found here). Ask to your GGA leader if your team have access to one or more storage spaces (this is the case for instance for the following teams: cidre and sirocco (compactdisk storage)).

I am physically located in the LORIA/IRISA building, is there a shorter path to connect?

Where your are located in LORIA/IRISA building, you can benefit from a direct connection that does not go through Grid'5000 national access machines (access-south and access-north). To do so, use access.nancy or access.rennes (instead of access).

Terminal.png mylaptop:
ssh jdoe@access.nancy.grid5000.fr
Terminal.png mylaptop:
ssh jdoe@access.rennes.grid5000.fr

Configure an SSH alias for the local access

To establish a connection to the Grid'5000 network from the local access, you can configure your SSH client as follows:

Terminal.png laptop:
editor ~/.ssh/config
Host g5kl
  User login
  Hostname access.site.grid5000.fr
  ForwardAgent no

Host *.g5kl
  User login
  ProxyCommand ssh g5k -W "$(basename %h .g5kl):%p"
  ForwardAgent no

Reminder: login is your Grid'5000 username and site is either nancy or rennes.

With such a configuration, you can:

  • connect the frontend related to your local site
Terminal.png laptop:
ssh g5kl
  • transfer files from your laptop to your local frontend (with better bandwidth than using the national Grid'5000 access)
Terminal.png laptop:
scp myFile g5kl:~/
  • access the frontend of a different site:
Terminal.png laptop:
ssh grenoble.g5kl
  • transfer files from your laptop to your a different frontend
Terminal.png laptop:
scp myFile sophia.g5kl:~/

How to access data in hosted on Inria/Loria or Inria/Irisa serveurs

Grid'5000 network is not directly connected to Inria/Loria or Inria/Irisa internal servers. If you want to access from the Grid'5000 frontend and/or the Grid'5000 nodes, you need to use a local Bastion host. If you need to regularly transfer data, it is highly recommanded to configure the SSH client on each Grid'5000 frontends.

Note.png Note

Please note that you have a different home directory on each Grid'5000 site, so you may need to replicate your SSH configuration across multiple sites.

Nancy

ssh-nge.loria.fr is an access machine hosted on Loria side. That machine can be used to access all services in the Inria/Loria environment.

Terminal.png frontend:
editor ~/.ssh/config
Host accessloria
   Hostname ssh-nge.loria.fr
   User <code class=replace>jdoe</code> # to be replaced by your LORIA login

Host *.loria
   ProxyCommand ssh accessloria -W $(basename %h .loria):%p
   User <code class=replace>jdoe</code> # to be replaced by your LORIA login
Note.png Note

Given that ssh-nge.loria.fr only accepts logins using SSH key, you cannot simply connect with your LORIA password.

Rennes

ssh-rba.inria.fr is an access machine hosted on Irisa side. That machine can be used to access all services in the Inria/Irisa environment.

Terminal.png frontend:
editor ~/.ssh/config
Host ssh-rba
   Hostname ssh-rba.inria.fr
   User <code class=replace>jdoe</code> # to be replaced by your IRISA login

Data hosted on Inria's NAS server is accessible on /nfs of ssh-rba.inria.fr. Considering that you have set the configuration on Grenoble homedir:

Terminal.png fgrenoble:
scp ssh-rba:/nfs/nas4.irisa.fr/repository ~/local_dir

Transfer files to Grid'5000 storage

With that setup, you can now use :

  • Rsync to synchronize your data on Inria/Loria environment and data on your local home on Grid'5000 frontend
  • Sshfs to mount directly your data directory on Inria/Loria environment under your local home. <=> mount your /user/my_team/my_username (origin = ssh-nge.loria.fr) on fnancy (destination = a folder on fnancy).

eg:

Terminal.png fnancy:
sshfs -o idmap=user jdoe@tregastel.loria:/users/myteam/jdoe ~/local_dir

To unmount the remote filesystem:

Terminal.png fnancy:
fusermount -u ~/local_dir

I submitted a job, there are free resources, but my job doesn't start as expected!

Most likely, this is because of our configuration of resources restriction per walltime. In order to make sure that someone requesting only a few nodes, for a small amount of time will be able to get soon enough, the nodes are split into categories. This depends on each cluster and is visible in the Gantt chart. An example of split is:

  • 20% of the nodes only accept jobs with walltime lower than 1h
  • 20% -- 2h
  • 20% -- 24h (1 day)
  • 20% -- 48h (2 days)
  • 20% -- 168h (one week)

Note that best-effort jobs are excluded from those limitations.

To see the exact walltime partition of each production cluster, have a look at the Nancy Hardware page or Rennes Hardware page.

Another OAR feature that could impact the scheduling of your jobs is the OAR scheduling with fair-sharing, which is based on the notion of karma: this feature assigns a dynamic priority to submissions based on the history of submissions by a specific user. With that feature, the jobs from users that rarely submit jobs will be generally scheduled earlier than jobs from heavy users.

I have an important demo, can I reserve all resources in advance?

There's a special challenge queue that can be used to combine resources from the classic Grid'5000 clusters and the production clusters for special events. If you would like to use it, please ask for a special permission from the executive committee.

Can I use besteffort jobs in production ?

Yes, you can submit a besteffort job on the production resources by using OAR -t besteffort option. Here is an exemple:

Terminal.png fnancy:
oarsub -t besteffort -q production./my_script.sh

If you didn't specify the -q production option, your job could run on both production and non-production resources.

How to cite / Comment citer

If you use the Grid'5000 production clusters for your research and publish your work, please add this sentence in the acknowledgements section of your paper:

Experiments presented in this paper were carried out using the Grid'5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as other organizations (see https://www.grid5000.fr).