Nancy:Production: Difference between revisions

From Grid5000
Jump to navigation Jump to search
No edit summary
(16 intermediate revisions by 3 users not shown)
Line 1: Line 1:
{{Portal|User}}
{{Portal|User}}
{{Author|Clément Parisot}}
{{Maintainer|Clément Parisot}}
{{Portal|User}}
{{Portal|User}}


Line 13: Line 11:
= Using the resources =
= Using the resources =
== Getting an account ==
== Getting an account ==
Users from Nancy (LORIA/Inria Nancy Grand-Est) that want to access Grid'5000 primarily for a production usage must use that '''[[Special:G5KRequestAccountUMS|request form]]''' to open an account, like regular Grid'5000 users.
Users from the '''Loria''' laboratory (LORIA/Inria Nancy Grand-Est) that want to access Grid'5000 primarily for a production usage must use that '''[[Special:G5KRequestAccountUMS|request form]]''' to open an account, like regular Grid'5000 users.


* The following fields must be filled as follows:  
* The following fields must be filled as follows:  
** Group Granting Access: either the group of the research team or if it does not exist, <code>nancy-misc</code>.
** ''Group Granting Access'' (GGA): either the group '''named after the research team''', or if it does not belong to the team list below: '''<code>loria</code>'''.
** Laboratory: LORIA
** ''Laboratory'': LORIA
** Team: SYNALP, MULTISPEECH, CARAMBA, CAPSID, ORPAILLEUR, etc.
** ''Team'': SYNALP, MULTISPEECH, CARAMBA, CAPSID, ORPAILLEUR, LARSEN, SEMAGRAMME, SISR, TANGRAM...
 
Other users from Nancy (not belonging to the Loria laboratory) can ask to join using the '''<code>nancy-misc</code>''' Group Granting Access.
 
* Users are automatically subscribed to the Grid'5000 users mailing lists: [mailto:users@lists.grid5000.fr users@lists.grid5000.fr]. This list is the user-to-user or user-to-admin communication mean to address help/support requests for Grid'5000.
* Users are automatically subscribed to the Grid'5000 users mailing lists: [mailto:users@lists.grid5000.fr users@lists.grid5000.fr]. This list is the user-to-user or user-to-admin communication mean to address help/support requests for Grid'5000.


Line 31: Line 32:
To access production resources, you need to submit jobs in the ''production'' queue or using the ''production'' job type:
To access production resources, you need to submit jobs in the ''production'' queue or using the ''production'' job type:
  oarsub -q production -I
  oarsub -q production -I
  oarsub -q production -p "cluster='grele'" -I
  oarsub -q production -p grele -I
  oarsub -q production -l nodes=2,walltime=24 -I
  oarsub -q production -l nodes=2,walltime=24 -I
  oarsub -q production -l walltime=24 -t deploy 'sleep 100d'
  oarsub -q production -l walltime=24 -t deploy 'sleep infinity'
  ...
  ...
or
or
  oarsub -t production -I
  oarsub -t production -I
  oarsub -t production -p "cluster='grele'" -I
  oarsub -t production -p grele -I
  oarsub -t production -l nodes=2,walltime=24 -I
  oarsub -t production -l nodes=2,walltime=24 -I
  oarsub -t production -l walltime=24 -t deploy 'sleep 100d'
  oarsub -t production -l walltime=24 -t deploy 'sleep infinity'


== Dashboards and status pages ==
== Dashboards and status pages ==
Line 51: Line 52:
* The Grid'5000 team can be contacted as described on the [[Support]] page.
* The Grid'5000 team can be contacted as described on the [[Support]] page.
* The Grid'5000 ''responsable de site'' for Nancy is Lucas Nussbaum ([mailto:lucas.nussbaum@loria.fr lucas.nussbaum@loria.fr])
* The Grid'5000 ''responsable de site'' for Nancy is Lucas Nussbaum ([mailto:lucas.nussbaum@loria.fr lucas.nussbaum@loria.fr])
* Ismael Bada (engineer funded by CPER LCHN) can also help local users, especially regarding requests related to deep learning on Grid'5000. ([mailto:ismael.bada@loria.fr ismael.bada@loria.fr])
To get support, you can:
To get support, you can:
* Use the [mailto:users@lists.grid5000.fr users@lists.grid5000.fr] mailing list: all Grid'5000 users (700+ people) are automatically subscribed
* Use the [mailto:users@lists.grid5000.fr users@lists.grid5000.fr] mailing list: all Grid'5000 users (700+ people) are automatically subscribed
* Use the [mailto:nancy-users@lists.grid5000.fr nancy-users@lists.grid5000.fr] mailing list: all Grid'5000 users from Nancy are automatically subscribed
* Use the [mailto:nancy-users@lists.grid5000.fr nancy-users@lists.grid5000.fr] mailing list: all Grid'5000 users from Nancy are automatically subscribed
* Contact Ismael Bada (see above)


The Grid'5000 team does not have the resources (manpower) to do user support, such as helping with writing scripts, creating system images, etc. If you need such help, please contact either Ismael Bada (see above), or the SED service.
The Grid'5000 team does not have the resources (manpower) to do user support, such as helping with writing scripts, creating system images, etc. If you need such help, please contact the SED service.


= FAQ =
= FAQ =
Line 123: Line 122:
== I submitted a job, there are free resources, but my job doesn't start as expected! ==
== I submitted a job, there are free resources, but my job doesn't start as expected! ==
Most likely, this is because of our configuration of resources restriction per walltime.
Most likely, this is because of our configuration of resources restriction per walltime.
In order to make sure that someone requesting only a few nodes, for a small amount of time will be able to get soon enough, the nodes are split into categories. This depends on each cluster and is visible in the Gantt chart. An example split is:
In order to make sure that someone requesting only a few nodes, for a small amount of time will be able to get soon enough, the nodes are split into categories. This depends on each cluster and is visible in the Gantt chart. An example of split is:
* 20% of the nodes only accept jobs with walltime lower than 1h
* 20% of the nodes only accept jobs with walltime lower than 1h
* 20% -- 2h
* 20% -- 2h
* 20% -- 24h
* 20% -- 24h (1 day)
* 20% -- 48h
* 20% -- 48h (2 days)
* 20% -- one week
* 20% -- 168h (one week)
Note that ''best-effort'' jobs are excluded from those limitations.
Note that ''best-effort'' jobs are excluded from those limitations.


Another enabled OAR feature that could impact the scheduling of your jobs is the OAR ''karma'': this feature assigns a dynamic priority to submissions based on the history of submissions by a specific user. With that feature, the jobs from users that rarely submit jobs will be generally scheduled earlier than jobs from heavy users.
'''To see the exact walltime partition of each production cluster''', have a look at the [[Nancy:Hardware#graffiti|Nancy Hardware page]].
 
Another OAR feature that could impact the scheduling of your jobs is the OAR scheduling with fair-sharing, which is based on the notion of ''karma'': this feature assigns a dynamic priority to submissions based on the history of submissions by a specific user. With that feature, the jobs from users that rarely submit jobs will be generally scheduled earlier than jobs from heavy users.


== I have an important demo, can I reserve all resources in advance? ==
== I have an important demo, can I reserve all resources in advance? ==
Line 142: Line 143:


== Is it possible to run Matlab? ==
== Is it possible to run Matlab? ==
Yes, through SSH tunneling to access UL license server (access to bastionssh.loria.fr required).
Yes, through SSH tunneling to access UL license server (access to bastionssh.loria.fr required).
More information is available in [https://members.loria.fr/FSur/articles/MatlabGrid5000.pdf this document].
More information is available in [https://members.loria.fr/FSur/articles/MatlabGrid5000.pdf this document].
'''Important note (2022-02-23)''' : what is described above do not work anymore from A to Z. We are currently working on the integration of Matlab through [[Environment_modules|environment modules]]. The SSH tunneling as it is described above will continue to be used to get access to the license server. In the meantime, if you need to use Matlab within Grid'5000 using your own installation and cannot access to the license server, please contact us at support-staff@lists.grid5000.fr.


== Energy costs ==
== Energy costs ==

Revision as of 12:30, 31 March 2022


Introduction

The Nancy Grid'5000 site also hosts clusters for production use (including clusters with GPUs). See Nancy:Hardware for details.

The usage rules differ from the rest of Grid'5000:

  • Advance reservations (oarsub -r) are not allowed (to avoid fragmentation). Only submissions (and reservations that start immediately) are allowed.
  • All Grid'5000 users can use those nodes (provided they meet the conditions stated in Grid5000:UsagePolicy), but it is expected that users outside of LORIA / Inria Nancy -- Grand Est will use their own local production resources in priority, and mostly use those resources for tasks that require Grid'5000 features. Examples of local production clusters are Tompouce (Saclay), Igrida (Rennes), Plafrim (Bordeaux), etc.

Using the resources

Getting an account

Users from the Loria laboratory (LORIA/Inria Nancy Grand-Est) that want to access Grid'5000 primarily for a production usage must use that request form to open an account, like regular Grid'5000 users.

  • The following fields must be filled as follows:
    • Group Granting Access (GGA): either the group named after the research team, or if it does not belong to the team list below: loria.
    • Laboratory: LORIA
    • Team: SYNALP, MULTISPEECH, CARAMBA, CAPSID, ORPAILLEUR, LARSEN, SEMAGRAMME, SISR, TANGRAM...

Other users from Nancy (not belonging to the Loria laboratory) can ask to join using the nancy-misc Group Granting Access.

  • Users are automatically subscribed to the Grid'5000 users mailing lists: users@lists.grid5000.fr. This list is the user-to-user or user-to-admin communication mean to address help/support requests for Grid'5000.

Learning to use Grid'5000

Refer to the Getting Started tutorial. There are other tutorial listed on the Users Home page.

Using deep learning software on Grid'5000

A tutorial for using deep learning software on Grid'5000, written by Ismael Bada is also available.

Using production resources

To access production resources, you need to submit jobs in the production queue or using the production job type:

oarsub -q production -I
oarsub -q production -p grele -I
oarsub -q production -l nodes=2,walltime=24 -I
oarsub -q production -l walltime=24 -t deploy 'sleep infinity'
...

or

oarsub -t production -I
oarsub -t production -p grele -I
oarsub -t production -l nodes=2,walltime=24 -I
oarsub -t production -l walltime=24 -t deploy 'sleep infinity'

Dashboards and status pages

Contact information and support

Contacts:

  • The Grid'5000 team can be contacted as described on the Support page.
  • The Grid'5000 responsable de site for Nancy is Lucas Nussbaum (lucas.nussbaum@loria.fr)

To get support, you can:

The Grid'5000 team does not have the resources (manpower) to do user support, such as helping with writing scripts, creating system images, etc. If you need such help, please contact the SED service.

FAQ

Data storage

The data needed for experiments of the production teams is stored on the talc-data and talc-data2 NFS servers. talc-data2 is a regular Group Storage server, and talc-data is a storage server dedicated to the multispeech research team, but compatible with the Group Storage mechanisms.

talc-data2 provides 213T of storage space. talc-data contains three disk bays (baie, baie2 and baie3) respectively hosting volumes talc, talc2 and talc2, respectively providing 58T + 58T + 71T = 187T of storage space.

 LV           VG        Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
 talc         baie      -wi-ao----  58.21t                                                    
 talc2        baie2     -wi-ao----  58.21t                                                    
 talc3        baie3     -wi-ao----  71.28t
/dev/mapper/baie2-talc2 on /export/group/talc2 type ext4 (rw,relatime,data=ordered)
/dev/mapper/baie3-talc3 on /export/group/talc3 type ext4 (rw,relatime,data=ordered)
/dev/mapper/baie-talc on /export/group/talc type ext4 (rw,relatime,stripe=32752,data=ordered)

Those repositories have the same permissions as any Group Storage volumes on their root folder,

drwxrws--T+ 4 root sto-multispeech 4096 Oct 13 22:13 talc
drwxrws--T+ 4 root sto-multispeech 4096 Mar 14  2019 talc2
drwxrws--T+ 4 root sto-multispeech 4096 Oct 10 10:50 talc3

but unlike other Group Storage volumes, they contain old data which is not affected by those permissions. Only the newly created directories under the root folder inherit those permissions.

Please remember that those data are hosted on a NFS server that is not recommended for compute usage.

For other shorter term storage support, see Storage.

I am physically located in the LORIA building, is there a shorter path to connect?

If for some reason you don't want to go through Grid'5000 national access machines (access-south and access-north), you can also connect directly using

Terminal.png mylaptop:
ssh jdoe@access.nancy.grid5000.fr

How to access data in Inria/Loria

bastionssh.loria.fr is an access machine hosted on Loria side. That machine can be used to access all services in the Inria/Loria environment.

You need to use the SSH ProxyCommand for that purpose.

Ajust the following lines for your ~/.ssh/config

Host accessloria
        Hostname bastionssh.loria.fr
        User jdoe # to be replaced by your LORIA login

Host *.loria
        User jdoe # to be replaced by your LORIA login
        ProxyCommand ssh accessloria -W $(basename %h .loria):%p

With that setup, you can now use :

  • Rsync to synchronize your data on Inria/Loria environment and data on your local home on Grid'5000 frontend
  • Sshfs to mount directly your data directory on Inria/Loria environment under your local home. <=> mount your /user/my_team/my_username (origin = bastionssh.loria.fr) on fnancy (destination = a folder on fnancy).

eg:

Terminal.png fnancy:
sshfs -o idmap=user jdoe@tregastel.loria:/users/myteam/jdoe ~/local_dir

To unmount the remote filesystem:

Terminal.png fnancy:
fusermount -u ~/local_dir
Note.png Note

Given that bastionssh.loria.fr only accepts logins using SSH key, you cannot simply connect with your LORIA password.

I submitted a job, there are free resources, but my job doesn't start as expected!

Most likely, this is because of our configuration of resources restriction per walltime. In order to make sure that someone requesting only a few nodes, for a small amount of time will be able to get soon enough, the nodes are split into categories. This depends on each cluster and is visible in the Gantt chart. An example of split is:

  • 20% of the nodes only accept jobs with walltime lower than 1h
  • 20% -- 2h
  • 20% -- 24h (1 day)
  • 20% -- 48h (2 days)
  • 20% -- 168h (one week)

Note that best-effort jobs are excluded from those limitations.

To see the exact walltime partition of each production cluster, have a look at the Nancy Hardware page.

Another OAR feature that could impact the scheduling of your jobs is the OAR scheduling with fair-sharing, which is based on the notion of karma: this feature assigns a dynamic priority to submissions based on the history of submissions by a specific user. With that feature, the jobs from users that rarely submit jobs will be generally scheduled earlier than jobs from heavy users.

I have an important demo, can I reserve all resources in advance?

There's a special challenge queue that can be used to combine resources from the classic Grid'5000 clusters and the production clusters for special events. If you would like to use it, please get in touch with the clusters managers.

Can I use besteffort jobs in production ?

Yes, you can submit a besteffort job on the production resources by using OAR -t besteffort option. Here is an exemple:

Terminal.png fnancy:
oarsub -t besteffort -q production./my_script.sh

If you didn't specify the -q production option, your job could run on both production and non-production resources.

Is it possible to run Matlab?

Yes, through SSH tunneling to access UL license server (access to bastionssh.loria.fr required). More information is available in this document.

Important note (2022-02-23) : what is described above do not work anymore from A to Z. We are currently working on the integration of Matlab through environment modules. The SSH tunneling as it is described above will continue to be used to get access to the license server. In the meantime, if you need to use Matlab within Grid'5000 using your own installation and cannot access to the license server, please contact us at support-staff@lists.grid5000.fr.

Energy costs

Grid'5000 nodes are automatically shut down when they are not reserved so, when possible, it is a good idea to reserve nodes during cheaper time slots.

Electricity costs are currently:

  • Périodes:
    • Heures pointe: Décembre, Janvier, Février; 09H00-11H00/18H00-20H00
    • Heures Pleines Hiver: 06H00-22H00 (hors heures de pointe précisées à l’article 18).
    • Heures Creuses Hiver: 22H00-06H00
    • Heures Pleines Eté: 06H00-22H00
    • Heures Creuses Eté: 22H00-06H00
    • Le dimanche ne comprend que des heures creuses en hiver et été.
  • Cout du KWh
    • Heure pointe 10,893c€/KWh
    • Heure Pleine Hiver 6,535c€/KWh
    • Heure Creuse Hiver 4,474c€/KWh
    • Heure Pleine Eté 4,125c€/KWh
    • Heure Creuse Eté 2,580c€/KWh

How to cite / Comment citer

If you use the Grid'5000 production clusters for your research and publish your work, please add this sentence in the acknowledgements section of your paper:

Experiments presented in this paper were carried out using the Grid'5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as other organizations (see https://www.grid5000.fr).