News

From Grid5000
Revision as of 10:12, 24 February 2023 by Apetit (talk | contribs)
Jump to navigation Jump to search

This page provides some news about the Grid'5000 testbed. You can also subscribe to the RSS feed and follow Grid'5000 on Twitter.

Note.png Note

Please note that only major changes are covered by news items. It may also be relevant to look at the Grid'5000 reference-repository, which provides the description of the platform. Most changes to the platform will show up there in a git commit.

<startFeed />

NFS Support Now Available for All Grid'5000 Supported Linux Distributions

We are pleased to announce that we have released the NFS version of all our supported Linux distributions! With this new feature, you can now deploy your desired distribution with pre-configured LDAP and network storage support. This will enable you to log in to the deployed node using your own Grid'5000 account (not just root) and access your home directory and group storage directly from the deployed node.

While Debian 10 and Debian 11 already had NFS support, we have now added this feature to the following distributions:

  • Ubuntu 18.04, Ubuntu 20.04, Ubuntu 22.04
  • CentOS 7, CentOS 8
  • CentOS Stream 8, CentOS Stream 9
  • Rocky Linux 8, Rocky Linux 9
  • Debian Testing.

All these new environments are suffixed with "-nfs" (e.g., `centosstream9-nfs`, `ubuntu2204-nfs`). For more information on our available environments or on how to deploy environments with kadeploy, please refer to our Getting Started and Advanced Kadeploy pages.

-- Grid'5000 Team 18:00, Feb 22nd 2023 (CEST)

New documentation for the FPGA nodes and wattmeter update in Grenoble

Documentation is now available at FPGA for the use of the FPGA nodes (servan cluster in Grenoble).

Furthermore, the Servan nodes' power supply units are now monitored by the Grenoble wattmeter and available in Kwollect.

Finally, due to the unreliability of the measures, Dahu is not monitored by the wattmeter anymore. Past problems with Yeti and Troll should now be fixed.

-- Grid'5000 Team 14:20, Sept 15th 2022 (CEST)

New cluster “servan” available in Grenoble, featuring FPGAs

We have the pleasure to announce a new cluster named “servan” available in Grenoble. It is composed of 2 Dell R7525 nodes, each equipped with two AMD EPYC 7352 24-Cores CPUs, 128GB of DDR4 RAM, two 1.6TB NVMe SSD, and a 25Gbps Ethernet.

Additionally, each node features a Xilinx Alveo U200¹ FPGA with two 100Gbps Ethernet ports connected to the Grid'5000 network.

For now, no support is provided by the Grid'5000 technical team to exploit those FPGAs. But please let us know if you plan to try and use those FPGAs.

This cluster has been funded by CNRS/INS2I to support the Slices project.

Warning: debian11-xen² is currently unavailable on servan due to a problem with the network cards.

¹ https://www.xilinx.com/products/boards-and-kits/alveo/u200.html

² https://intranet.grid5000.fr/bugzilla/show_bug.cgi?id=13923

-- Grid'5000 Team 14:40, June 30th 2022 (CEST)

Shorter command usage for Kadeploy tools

We changed the syntax of the Kadeploy commands. A typical command line for deployment was:

 kadeploy3 -f $OAR_NODE_FILE -e debian11-x64-base -k
  • The -e flag can now be omitted when deploying a recorded environment, while when deploying an anonymous deployment, the -a flag must still be set.
  • The public key is copied by default, -k can be omitted.
  • Nodes to be deployed are taken from $OAR_NODE_FILE if not specified on the command line with -f or -m.

As a result, the same deployment can be achieved with the following shorter command line:

 kadeploy3 debian11-x64-base

Moreover, the architectures are removed from the Grid'5000 reference environments names (eg: "debian11-x64-base" and "debian11-ppc64-base" both become "debian11-base"). Kadeploy infers the correct environment (CPU architecture) for the cluster to deploy. As a result, the previous kadeploy command line can be shortened to:

 kadeploy3 debian11-base

Old syntaxes and environment names including the architecture are however still supported for backward compatibility.

You can find more information on the Getting Started and Advanced Kadeploy pages.

-- Grid'5000 Team 10:50, Wed, May 19th 2022 (CEST)

Cluster "sirius" from Nvidia is available in the default queue

We have the pleasure to announce that a new cluster named "sirius" is available at Lyon.

This cluster consists in only one Nvidia DGX A100 node with 2 AMD EPYC 7742 CPUs (64 cores per CPU) and 8 Nvidia A100 (40 GiB) GPUs, 1TB of DDR4 RAM, 2x1.92 TB SSD + 4x3.84 TB SSD disks and a coming-soon infiniband network.

Energy monitoring is available for this cluster, provided by the same wattmeter devices as used for the other clusters in Lyon.

This cluster is tagged as "exotic", so the `-t exotic` option must be provided to oarsub to select sirius.

This machine has been funded by LIP laboratory with the support of INRIA Grenoble Rhone-Alpes and ENS Lyon.

-- Grid'5000 Team 11:30, Wed, May 11th 2022 (CEST)

Ubuntu 22.04 LTS environment available

A kadeploy environment (image) for Ubuntu Jammy Jellyfish 22.04 LTS (ubuntu2204-min) is now available and registered in all Grid'5000 sites with Kadeploy.

Only x86_64 architecture is available for now. For ARM (arm64/aarch64) and POWER 8 (ppc64/ppc64le), you can use the previous Ubuntu 20.04 LTS images (ubuntu2004-min).

Please refer to our getting started document to learn how to deploy such an image on the platform [1].

This image is built with Kameleon (just like other Grid'5000 environments). The recipe is available in the environments-recipes git repository.

If you need other system images for your work, please let us know.

-- Grid'5000 Team 13:40, Fri, May 6th 2022 (CEST)

End of support for Debian 9 "Stretch" environments

We are going to drop support for the stretch/debian9 environments on April 30st 2022.

The last version of strech environments i.e 2021100608 will remain available on /grid5000. You can still access older versions of debian 9 environments under the archive directory (see /grid5000/README.unmaintained-envs for more information).

-- Grid'5000 Team 14:40, Tue, 19 Apr 2022 (CEST)

Syntax simplification of the OAR job submissions and removal of allow_classic_ssh job type

Three behavior changes about OAR on Grid'5000 have been made, to simplify the job submission commands.


1. Syntax simplification mechanism to ease OAR job submissions. As announced previously, a syntax simplification mechanism has been deployed on Grid'5000. It is now enabled by default (until today, its usage was available only by using the `syntax-simplifier` job type).

Here are some examples of its possibilities:

  • `oarsub -p dahu-19` will reserve the dahu-19 node
  • `oarsub -p dahu` will reserve one node on the dahu cluster
  • `oarsub -p "ssd&256GB" -l nodes=2` will reserve two nodes with SSD storage and at least 256GB of memory
  • `oarsub -l "{host+disk&grimoire}` will reserve one node on grimoire cluster with all its disks
  • `oarsub -p "host IN (yeti-1,yeti-2,yeti-3)` will reserve one of the listed nodes (here FQDN and simple quotes are omitted)

Syntax simplification implements the following features:

  • automatic addition of simple quotes for character strings in expressions
  • automatic expansion of nodes' short names to long names (FQDN)
  • usage of "&" as an alternative to "and"; "|" as an alternative to "or"
  • aliases to translate short and simple keywords to OAR resource filter expressions, for the most common resources (GPUs, clusters, nodes, disks,…)

The available aliases and a more complete overview are available on the [OAR Syntax simplification] page. Please note that the standard OAR syntax is still supported. The simplifications described here are not replacements but additions. If you encountered any issues, the syntax simplification can be disabled by adding the `no-syntax-simplifier` job type at job submission (typically, by adding `-t no-syntax-simplifier` to your oarsub command line).


2. OAR's job type `allow_classic_ssh` is removed. It is now possible to connect using `ssh` to nodes reserved entirely, without using the `allow_classic_ssh` job type. `oarsh` is still supported and still required to connect to jobs that do not reserve entire nodes (all CPU cores). A warning message will be displayed whenever a job submission still uses the `allow_classic_ssh` type since it is now useless.


3. oarsub's -r option enhanced to ease advanced reservation submission This option is used to place a reservation in advance, by passing the wanted start date of the job. The date format is `YYYY-MM-DD hh:mm:ss`, but it is now possible to omit some part of this format, depending on your job start requirement. Also, the end date of the job can now be set, using `<START DATE>, <END DATE>` format. If an end date is provided, the job walltime is inferred. When omitting `YYYY-MM-DD`, it will default to the current day. In `hh:mm:ss`, minutes and seconds can also be omitted. Finally, the special keyword `now` can be used in place of the current date.

For instance, the following strings are valid with `-r`:

  • "2022-03-30 19", to start the job at 19:00:00 on the 30/03/2022
  • "19", to start the job today at 19:00:00
  • "19:30", to start the job today at 19:30:00
  • "2022-03-30 19,2022-04-30 04", to start the job on the 30/03/2022 at 19:00:00 with a walltime of 09:00:00

-- Grid'5000 Team 10:27, Thu, 31 Mar 2022 (CEST)

New documentation for the Grid'5000 environment creation

We are pleased to announce the new documentation/tutorial for the Grid'5000 environment creation. See the Environment creation page.

(If needed, the former documentation is still available on the old page).

-- Grid'5000 Team 11:09, Wed, 23 Feb 2022 (CEST)

Testing phase for the syntax simplification of the OAR job submissions

Grid'5000 uses the OAR resources and jobs manager. The oarsub tool and the Rest API are the entry points to select the wanted resources, using a syntax that is not easy to master.

To facilitate jobs submission, a new syntax simplification mechanism is now available on Grid'5000, with following features:

  • automatic addition of simple quotes for strings in expressions
  • usage of short nodes' name instead of FQDN
  • usage of "&" as an alternative to "and" ; "|" as an alternative to "or"
  • aliases to translate short and simple keywords to OAR resource filter expressions, for the most common resources (GPUs, clusters, nodes, …)

The available aliases and a more complete overview are available on the Grid'5000 wiki [1].

Please note that the standard OAR syntax is still supported, the simplifications described here are not replacements but additions.

For now, syntax simplification is only available by adding `-t syntax-simplifier` to oarsub command line, as we are in a testing phase before enabling it by default. Grid'5000 documentation and tutorials will be updated to use the aliases at the end of the testing phase.

If you encountered any problem or have any suggestion, feel free to contact us (support-staff@lists.grid5000.fr).

1: https://www.grid5000.fr/w/OAR_Syntax_simplification

-- Grid'5000 Team 13:39, Tue, 15 Feb 2022 (CEST)

Debian 11 "Bullseye" std environment is now the default environment on nodes

We are pleased to announce that Debian 11 (Bullseye) std environment (debian11-std) is now the default environment on nodes. See `kaenv3 -l debian11%` for an exhaustive list of available variants.

New features and changes in Debian 11 are described in: https://www.debian.org/releases/bullseye/amd64/release-notes/ch-whats-new.en.html .

In particular, it includes many software updates:

  • Cuda 11.2.2 / Nvidia Drivers 460.91.03
  • OpenJDK 17
  • Python 3.9.2
    • python3-numpy 1.19.5
    • python3-scipy 1.6.0
    • python3-pandas 1.1.5
  • Perl 5.32.1
  • GCC 9.3 and 10.2
  • G++ 10.2
  • Libboost 1.74.0
  • Ruby 2.7.4
  • CMake 3.18.4
  • GFortran 10.2.1
  • Liblapack 3.9.0
  • libatlas 3.10.3
  • RDMA 33.2
  • OpenMPI 4.1.0

Some additional important changes have to be noted:

  • Python 2 is not included as it is deprecated since Jan. 2020, and `/usr/bin/python` is symlinked to `/usr/bin/python3`. Note that Python2 packages are still available on repositories and that you can install the 'python-is-python2' package to change the symlink.
  • OpenMPI
    • The OpenMPI backend used for Infiniband and Omnipath networks is UCX (this is similar to mainstream OpenMPI, but differs from the default behaviour in Debian 11).
    • The libfabric package has been recompiled to disable a buggy EFA provider (AWS)¹.
  • Apache MXNet deep learning framework is no more officially supported by the team, as his documentation.
  • As Ganglia monitoring has been replaced by Kwollect², Ganglia services have been removed from environments.
  • The BeeGFS client is not operational³, which impacts the grcinq and grvingt clusters. As a workaround, users can deploy a debian10 environment to get BeeGFS.
  • Nvidia GPU : Cuda/Nvidia drivers shipped in debian11 environment do not support - out-of-the-box or at all - some quite old GPUs.
    • CUDA compiler need a compatibility mode to work with the grimani cluster (`nvcc --gpu-architecture=sm_35`) (Tesla K40m).
    • Node graphique-1 has been retired as its GPU is no longer supported (Nvidia Titan Black). The other nodes graphique-[2-6] are still supported (GeForce GTX 980). Please note that after this retirement, the graphique cluster is now homogeneous (all nodes have now the same GPUs).
  • Systemd defaults to using control groups v2 (cgroupv2) in bullseye, but it as been replaced by control groups v1 (cgroupv1) in std environments only.
  • As a reminder, legacy block device naming (eg. sda) is no longer supported in recent kernels and this may need deployments or experiments scripts to be modified⁴.

As previous default environment move, frontend of every sites have been reinstalled in Debian 11 too.

The wiki documentation has been updated to take into account Debian 11 usage.

¹: https://intranet.grid5000.fr/bugzilla/show_bug.cgi?id=13260

²: https://www.grid5000.fr/w/Monitoring_Using_Kwollect

³: https://intranet.grid5000.fr/bugzilla/show_bug.cgi?id=13077

⁴: https://www.grid5000.fr/w/News#Kadeploy:_use_of_partition_label_and_arbitrary_disk_identifier

-- Grid'5000 Team 17:15, Dec 1st 2021 (CEST)

Kadeploy: Machine's architecture added to environments description

Kadeploy environments now require the architecture information in a new field named “arch” in the environment’s descriptions.

Current valid architectures on Grid’5000 are: x86_64, ppc64le (drac cluster in Grenoble), aarch64 (pyxis cluster in Lyon).

We have already added the architecture information in every recorded environment (kaenv database). If you deploy from an environment description (-a option of kadeploy), you will need to add this new field yourself. Please contact support-staff@lists.grid5000.fr in case of issues.

If multiple environments share the same name but have different architectures, Kadeploy will automatically select the appropriate architecture for a given cluster.

Taking advantage of this new feature, the architecture code names will soon be removed from the Grid’5000 reference environments names. For instance, debian11-x64-big will become debian11-big: “x64” disappears but “x86_64” is set in the environment architecture tag.

-- Grid'5000 Team 11:38, November 18th 2021 (CET)

Cluster "neowise" from AMD is available in the default queue

We have the pleasure to announce that the "neowise" cluster is now available at Lyon in the default queue. It features 10 nodes, each including an AMD EPYC 7642 CPU with 48 cores, 8 AMD Radeon Instinct MI50 GPUs (32 GB), 512GB of RAM and an Infiniband network.

Remind that this cluster is tagged as "exotic", so the `-t exotic` option must be provided to oarsub to select neowise.

The cluster has been donated by AMD to Genci and Inria, in particular to support research against COVID-19. We would like to thank AMD for this donation and Genci for their successful collaboration in making this machine available in Grid'5000.

-- Grid'5000 Team 13:20, November 9th 2021 (CET)

New cluster “gruss” available in Nancy

We have the pleasure to announce that a new cluster named “gruss” is available in Nancy in the production queue. It features 4 Dell R7525 nodes with two AMD EPYC 7352 24-Cores CPUs, two NVidia A40-48GB GPUs, 256GB of DDR-4 RAM, one 1.92TB SAS SSD.

This cluster has been funded mainly by the MULTISPEECH research team (Grant from the Ministère des Armées on the application of deep learning techniques for domain adaptation in speech processing - ANR projects JCJC DiSCogs, Deep Privacy, LEAUDS, Flash Open Science HARPOCRATES), helped by the TANGRAM (ANR project Prespin) and CAPSID (H2020 ITN project RNAct) research teams.

About the name: Gruss is a well known French Circus family whose some roots [1] are in Lorraine (Nancy’s region). Strange coincidence: It happens that the Arlette Gruss Circus [2] is currently on tour in Nancy…

¹: https://fr.wikipedia.org/wiki/Alexis_Grüss_Sénior

²: https://actu.fr/grand-est/nancy_54395/fin-des-animaux-sauvages-diner-spectacle-le-cirque-arlette-gruss-de-retour-a-nancy_46094985.html

-- Grid'5000 Team 15:20, November 5th 2021 (CET)

Debian 11 "Bullseye" is now the default environment in Sophia (uvb cluster)

In the context of the migration to the latest Debian 11 version, Sophia site has been updated:

  • debian11-x64-std is now the default environment on uvb cluster.
  • The frontend (fsophia) has been migrated to Debian 11.

The other Grid'5000 sites and clusters will be migrated once everything is validated in Sophia. This should occur in the next few weeks and we will communicate the exact date as soon as possible (it might impact your Grid'5000 usage, e.g. scripts needing to be updated...).

Feel free to test and report any problem you may encounter in Sophia with the debian11 environments or the frontend to support-staff@lists.grid5000.fr .

-- Grid'5000 Team 15:53, Oct 28th 2021 (CEST)

Debian 11 "Bullseye" environments are ready for deployments

We are pleased to announce that Debian 11 (Bullseye) environments are now the default environments supported for deployments in Grid'5000. See `kaenv3 -l debian11%` for an exhaustive list of available variants.

The default environment available on nodes will remain the same (debian10-std) for some time to come (see below).

New features and changes in Debian 11 are described in: https://www.debian.org/releases/bullseye/amd64/release-notes/ch-whats-new.en.html .

In particular, it includes many software updates:

  • Cuda 11.2.2 / Nvidia Drivers 460.91.03
  • OpenJDK 17
  • Python 3.9.2
    • python3-numpy 1.19.5
    • python3-scipy 1.6.0
    • python3-pandas 1.1.5
  • Perl 5.32.1
  • GCC 9.3 and 10.2
  • G++ 10.2
  • Libboost 1.74.0
  • Ruby 2.7.4
  • CMake 3.18.4
  • GFortran 10.2.1
  • Liblapack 3.9.0
  • libatlas 3.10.3
  • RDMA 33.2
  • OpenMPI 4.1.0

Some additional important changes have to be noted:

  • Python 2 is not included as it is deprecated since Jan. 2020, and `/usr/bin/python` is symlinked to `/usr/bin/python3`. Note that Python2 packages are still available on repositories and that you can install the 'python-is-python2' package to change the symlink.
  • OpenMPI
    • The OpenMPI backend used for Infiniband and Omnipath networks is UCX (this is similar to mainstream OpenMPI, but differs from the default behaviour in Debian 11).
    • The libfabric package has been recompiled to disable a buggy EFA provider (AWS)¹.
  • As Ganglia monitoring has been replaced by Kwollect², Ganglia services have been removed from environments.
  • The BeeGFS client is not operational³, which impacts the grcinq and grvingt clusters. As a workaround, users can deploy a debian10 environment to get BeeGFS.
  • Nvidia GPU : Cuda/Nvidia drivers shipped in debian11 environment do not support - out-of-the-box or at all - some quite old GPUs.
    • CUDA compiler need a compatibility mode to work with the grimani cluster (`nvcc --gpu-architecture=sm_35`) (Tesla K40m).
    • Node graphique-1 has been retired as its GPU is no longer supported (Nvidia Titan Black). The other nodes graphique-[2-6] are still supported (GeForce GTX 980). Please note that after this retirement, the graphique cluster is now homogeneous (all nodes have now the same GPUs).
  • Neowise cluster does not have a working GPU on Debian 11 for now⁴ (No AMD GPU driver is available at the moment).
  • Systemd defaults to using control groups v2 (cgroupv2) in bullseye.
  • As a reminder, legacy block device naming (eg. sda) is no longer supported in recent kernels and this may need deployments or experiments scripts to be modified⁵.

The wiki documentation has been updated to take into account Debian 11 usage.

Debian 11 migration for default nodes's environment ("debian11-std") is almost ready and should happen in few weeks.

¹: https://intranet.grid5000.fr/bugzilla/show_bug.cgi?id=13260

²: https://www.grid5000.fr/w/Monitoring_Using_Kwollect

³: https://intranet.grid5000.fr/bugzilla/show_bug.cgi?id=13077

⁴: https://intranet.grid5000.fr/bugzilla/show_bug.cgi?id=13159

⁵: https://www.grid5000.fr/w/News#Kadeploy:_use_of_partition_label_and_arbitrary_disk_identifier

-- Grid'5000 Team 17:08, Oct 18th 2021 (CEST)

Reorganisation of Taurus and Orion clusters' hard disks

Dear Grid'5000 users,

Due to an increasing number of malfunctionning disks, the hardware of Taurus and Orion clusters in Lyon is going to change.

Currently, nodes on this cluster use 2 physical disks to provide a virtual disk space of 558GB (using a RAID-0). From Monday, October 18th, only a single disk will be kept: nodes' disk space will be reduced to 279GB and I/O performance will also be degraded. However, thanks to this operation, more Taurus and Orion nodes will be available.

The operation is planned for Monday 18th October, nodes will be inaccessible for the day.

Best regards,

-- Grid'5000 Team 16:45, October 13th 2021 (CEST)

Behavior change about disks identification

Since Linux 5.3[1], SCSI device probing is no longer deterministic, which means that it is no longer possible to rely on a block device name (sda, sdb, and so on) to correctly identify a disk on Grid'5000 nodes. In practice, a given disk that is called sda at one point might be renamed to sdb after a reboot.

As Debian 11 is using Linux in version 5.10, some changes are required to handle this issue. Grid'5000's internal tooling and the reference repository have been updated to work without the block devices' name. Those modifications are transparent for Grid'5000's users.

An arbitrary id was introduced for each disk in the reference repository, in the form of diskN (disk0 being the primary disk for a node). Those can be found on each site's hardware page and in the description of each node in the reference API. The device key (containing the block device's name) was removed from each node description.

To ease disks manipulation on nodes, symbolic links have been added in the /dev/ directory: each disk id (e.g: /dev/disk2) points to the correct block device. Similarly, links have been added for partitions: for instance, /dev/disk2p1 points to the first partition of disk2. It is recommended to use these paths from now on. Those symbolic links are managed by some udev rules which are deployed with g5k-postinstall using the new option '--disk-aliases'. Therefore, the links can also be added inside custom environments by passing this option in the kaenv description (at the postinstall level).

The documentation[2] about disk reservation is also up to date on how to identify and use the reserved disk(s).

1: https://lore.kernel.org/lkml/59eedd28-25d4-7899-7c3c-89fe7fdd4b43@acm.org/t/

2: https://www.grid5000.fr/w/Disk_reservation

-- Grid'5000 Team 11:44, October 13th 2021 (CEST)


Major update of BIOS and other firmwares

Since this summer, we have performed a campaign of firmware updates (BIOS, Network interface Cards, RAID adapters…) of the nodes of most Grid'5000 clusters.

These updates may provide better hardware reliability or mitigations for security issues.

However those changes may have an impact on your experiments (particularly in terms of performance). This is a difficult issue where there is no good solution, as it is often hard or impossible to downgrade BIOS versions.

Remind that firmware versions are published in the reference API. We recommend that you use this information to track down changes that could affect your experiments.

For instance, in https://api.grid5000.fr/stable/sites/nancy/clusters/gros/nodes/gros-1.json?pretty=1 , see bios.version and firmware_version. You can also browse previous versions using the API ¹, or using the GitHub commit history ².

We will continue to update such firmware in the future (always synchronized for the same cluster and documented in the reference API).

¹: https://api.grid5000.fr/doc/3.0/reference/spec.html#get-3-0-item-uri-versions

²: https://github.com/grid5000/reference-repository/commits/master

-- Grid'5000 Team 16:55, October 12th 2021 (CEST)


Kadeploy: use of partition label and arbitrary disk identifier

Kadeploy now use GPT label to identify each partition. In most case, this change should be transparent. It will only impact you if you deploy on a specific partition or use a custom partitioning scheme. Please refer to the “Advanced Kadeploy” page: https://www.grid5000.fr/w/Advanced_Kadeploy .

Moreover, disk identification is not done by using block device names anymore. Instead, Kadeploy now uses arbitrary disk identifiers (disk0, disk1...). You can find these identifiers by querying the API¹ or on each site hardware page. This should be transparent if you only use kadeploy on the default disk. If not, "kadpeloy3 -b disk1 ..." must be used instead of "kadeploy3 -b sdb ...".

This change is part of a larger effort to stop using block device names (e.g. sda, sdb...), since recent Linux kernels do not insure anymore a persistent (deterministic) naming for the block devices of disks over reboots². We are also planning to remove them from the API in the near future.

¹ check for the id key in the storage_devices of the description of the node. eg: https://api.grid5000.fr/stable/sites/nancy/clusters/graoully/nodes/graoully-1.json?pretty

² https://lore.kernel.org/lkml/59eedd28-25d4-7899-7c3c-89fe7fdd4b43@acm.org/t/

-- Grid'5000 Team 16:09, September 30th 2021 (CEST)

New service available: reconfigurable Firewall for IPv6

Dear users,

A new service is available, which allows opening the Grid'5000 firewall for IPv6 connections from the Internet.

This service should allow an easier interconnection between experiments in Grid5000 and outside resources, without the burden and performance impact of having to setup tunnels (VPNs, ssh tunnels, etc.)

To benefit from this new possibility, you need to use IPv6 (Grid'5000 nodes have globally routable IPv6 addresses, which allows these nodes to be connected from the Internet, whereas in IPv4, network address translation prevents the connections to be initiated from the outside).

Grid'5000 firewall openings are requested using an API (which allows complete automation of these requests). These openings are allowed for Grid'5000 nodes in the context of OAR jobs and are closed at the end of the jobs.

A detailed documentation and usage example is available in the wiki: https://www.grid5000.fr/w/Reconfigurable_Firewall

IPv6 related documentation is also available in the wiki: https://www.grid5000.fr/w/IPv6

-- Grid'5000 Team 10:41, September 10th 2021 (CEST)

Usage policy update: changing the BIOS or BMC setting or flashing firmware is not allowed

Dear users,

The platform's usage policy was completed in order to prevent uncontrolled changes of BIOS or BMC settings or any device firmware.

Please read https://www.grid5000.fr/w/Grid5000:UsagePolicy#Changing_the_BIOS_or_BMC_setting_or_flashing_firmware_is_not_allowed.

-- Grid'5000 Team 10:30, September 8th 2021 (CEST)

Grid'5000 metadata bundler is available in alpha version

Dear users,

When running experiments on Grid'5000, users generate metadata across multiple services. The extraction and the storage of this metadata is useful for matters of scientific data management and reproducibility. We are currently developing a tool designed to collect this metadata concerning an experiment from the different Grid'5000 services and bundle it in a single compressed archive. Making it easier to find, study, and store information about your experiments.

The g5k-metadata-bundler is now available in alpha version on all Grid'5000 node frontends. At this point in time, the bundles generated by g5k-metadata-bundler contains:

  • the OAR information of a single job,
  • copies from the specifications of the nodes involved,
  • all monitoring information collected by kwollect during the job.

As the software is still in alpha version, the contents and structure of the bundle is subject to changes and a list of planned features is available on the dedicated wiki page.

We are also interested in user feedback on the support mailing list.

-- Grid'5000 Team 14:30, July 22nd 2021 (CEST)

New AMD cluster named "neowise" available in Lyon for testing

Dear users,

A new cluster, named "neowise" is available in Lyon for testing.

This machine is a donation from AMD to Genci and Inria to support the French research community (in particular for works in relation with COVID19).

The cluster has 10 nodes, each including an AMD EPYC 7642 48-Core processor, 512GB of RAM, 8 Radeon MI50 GPUs, and an HDR Infiniband network (200Gbs). Its full characteristics are described at: https://www.grid5000.fr/w/Lyon:Hardware#neowise

The cluster is still in testing phase and few issues are known: - Few nodes currently have problems and will be unavailable until fixed - The software stack to use AMD GPUs is incomplete. The HIP compiler is included in default environment but many libraries or software (such as Deep Learning frameworks) are lacking. They will be added soon (mostly as "environment modules") - Some Grid'5000 "advanced" features are missing. An overview of what is working correctly is available at: https://intranet.grid5000.fr/jenkins-status?config=neowise

The neowise cluster is tagged as "exotic" and is currently available in "testing" queue. To submit a job, don't forget to add the appropriate options to oarsub. For instance:

$ oarsub -q testing -t exotic -p "cluster = 'neowise'" -I

We would like to thank AMD for this donation and Genci for their successful collaboration in making this machine available in Grid'5000.

-- Grid'5000 Team 14:30, June 24th 2021 (CEST)

Debian 11 "Bullseye" preview environments are now available for deployments

Debian 11 stable (Bullseye) will be released in a few weeks. We are pleased to offer a "preview" of kadeploy environments for Debian 11 (currently still Debian testing), that you can deploy already. See the debian11 environments in kaenv3.

New features and changes in Debian 11 are described in: https://www.debian.org/releases/bullseye/amd64/release-notes/ch-whats-new.en.html .

In particular, it includes many software updates:

  • Cuda 11.2 / Nvidia drivers 460.73.01
  • OpenJDK 17
  • Python 3.9.2
    • python3-numpy 1.19.5
    • python3-scipy 1.6.0
    • python3-pandas 1.1.5

(Note that Python 2 is not included as deprecated since Jan. 2020 and /usr/bin/python is symlinked to /usr/bin/python3)

  • Perl 5.32.1
  • GCC 9.3 and 10.2
  • G++ 10.2
  • Libboost 1.74.0
  • Ruby 2.7.3
  • CMake 3.18.4
  • GFortran 10.2.1
  • Liblapack 3.9.0
  • libatlas 3.10.3
  • RDMA 33.1
  • OpenMPI 4.1.0

Known regressions and problems are :

  • The std environment is not ready yet, and will not be the default Grid'5000 environment until official Debian 11 Bullseye release.
  • Cuda/Nvidia drivers do not support - out-of-the-box or at all - some quite old GPUs.
  • BeeGFS is not operational at the moment.

Let us know if you want us to support some tools, softwares,… that are not available on big images.

As a reminder, you can use the following commands to deploy an environment on nodes (https://www.grid5000.fr/w/Getting_Started#Deploying_nodes_with_Kadeploy):

 $ oarsub -t deploy -I
 $ kadeploy3 -e debian11-x64-big -f $OAR_NODE_FILE -k

Please report any problem you may encounter with the debian11 environments to support-staff@lists.grid5000.fr.

-- Grid'5000 Team 16:30, June 14th 2021 (CEST)

Kadeploy: use of UUID partition identifiers and faster deployments

Up to now, kadeploy was identifying disk partitions with their block device names (e.g. /dev/sda3) when deploying a system. This no longer works reliably because of disk inversion issues with recent kernels. As a result, we have changed kadeploy to use filesystem UUIDs instead.

This change affects the root partition passed to the kernel command line as well as the generated /etc/fstab file on the system.

If you want to keep identifying the partitions using block device names, you can use the "--bootloader no-uuid" and the "--fstab no-uuid" options of g5k-postinstall, in the postinstalls/script of the description of your environment. Please refer to the "Customizing the postinstalls" section of the "Advanced Kadeploy" page: Advanced_Kadeploy#Using_g5k-postinstall

As an additional change, Kadeploy now tries to use kexec more often, which should make the first deployment of a job noticeably faster.

-- Grid'5000 Team 17:30, June 9th 2021 (CEST)

New monitoring service with Kwollect is now stable

The new Grid'5000 monitoring service based on Kwollect is now stable. Kwollect will now serve requests adressed to Grid'5000 "Metrology API", i.e., from this URL:

https://api.grid5000.fr/stable/sites/SITE/metrics

The former API based on Ganglia is no longer available, and Ganglia will be removed from Grid'5000 environments starting from the next Debian version (debian11).

Main features of Kwollect are:

  • Focus on "environmental" monitoring, i.e. metrics not available from inside the nodes: electrical consumption, temperature, metrics from network equipments or nodes' BMC… but Kwollect also monitors node's metrics from Prometheus exporters
  • Support for Grid'5000 wattmeters at high frequency
  • On-demand activation of optional metrics
  • Custom metrics can be pushed by users
  • Grafana-based vizualisation

Its usage of this new service is described at: Monitoring_Using_Kwollect

Here are the main other changes since the last annoucement on December:

  • Monitoring related entries in Reference API have been cleaned (nodes' "sensors" and "wattmetre" keys removed, "pdu" moved at top level), OAR property wattmeter=SHARED and wattmeter=MULTIPLE removed
  • Visualization now uses Grafana
  • Metrics naming and polling period have been updated
  • Grid'5000 documentation has been updated
  • Bugs and performance fixes

-- Grid'5000 Team 12:00, May 21th 2021 (CEST)

Upgrade of the graffiti-13 node in Nancy

We are happy to announce that the graffiti-13 node on the Nancy site in production queue has been upgraded: its 4 GeForce RTX 2080 Ti 11GB have been replaced by 4 RTX 6000 24GB GDDR6[1].

Reservation example : oarsub -q production -p "host='graffiti-13.nancy.grid5000.fr'" -I

[1] https://www.nvidia.com/fr-fr/design-visualization/quadro/rtx-6000/

-- Grid'5000 Team 17:00, May 6th 2021 (CEST)

Yeti nodes now equipped with 2 NVMe of 1.6 TB and made exotic resources

The yeti nodes of the Grenoble site now have 2 NVMe drives of 1.6TB again (back to their original configuration of 2018) and the troll nodes have new NVMe (1.6TB also, but newer model).

The 4 new NVMe were financed by the Software Heritage project. Many Thanks.

Since the 4 yeti nodes are the only quadri-cpu resources of Grid'5000 and with 2 NVMe, they are now defined as exotic resources, and must be reserved with `-t exotic` in the oarsub arguments.

-- Grid'5000 Team 16:00, April 21th 2021 (CEST)

Docker cache available inside Grid5000

With the growing use of Docker, you might have experienced some rate limiting when pulling images due to the Docker Hub recent policy change.

As a result, we now provide a Docker cache accessible from inside Grid'5000 at the following address http://docker-cache.grid5000.fr

For configuration instructions and more information, see: Using docker-cache.grid5000.fr

-- Grid'5000 Team 09:00, April 13th 2021 (CEST)

Grid'5000 environments now use zstd compression

We are moving the default image archive format used for Grid'5000 environments from tar.gz to tar.zst. Switching to zstd allows for significantly faster decompression speed.

This change is transparent for most usages. Kadeploy will use zstd transparently for environments using it (e.g. new environments), the main impact is that deployment will be faster. If kameleon is used to generate an environment (thus creating an environment description file and an image archive), it will now automatically use zstd by default.

However if you manually modify the description of an environment and/or regenerate an image archive with tgz-g5k (using the -z switch for zstd), you may have to make sure that both use the same compression.

This change is part of a larger effort to improve the performance of kadeploy.

-- Grid'5000 Team 16:00, March 31st 2021 (CEST)

Guix package manager available on Grid'5000

We now offer the possibility to use Guix on Grid'5000 as a package manager to install new software or build container images without the need to be root.

It is available on frontends and nodes (on standard or deployed *-nfs or *-big environments with your user).

This work has been done in collaboration with the Guix HPC team.

For more information see: Guix

-- Grid'5000 Team 14:20, March 18th 2021 (CET)

New cluster "grouille" available in Nancy

We have the pleasure to announce that a new cluster named "grouille" is available in Nancy in the default queue. It features 2 Dell R7525 nodes with two AMD EPYC 7452 32-Core CPUs, two NVidia A100-PCIE-40GB GPUs, 128GB of DDR-4 RAM, one 1.8TB SAS SSD and one 900GB SATA SSD (reservable). The reservation of the Grouille nodes requires using the "exotic" OAR job type.

This cluster has been funded by CPER Cyber Entreprises.

-- Grid'5000 Team 17:00, March 15th 2021 (CET)

New Jupyter Notebooks interface

We are pleased to announce that Grid'5000 now offers a web interface to run and edit notebooks on the testbed. Available at https://intranet.grid5000.fr/notebooks/ , this interface allows users to start notebook applications on site's frontends or in OAR jobs.

Computational notebooks combine in a single file text, code cells, and the recorded outputs of the execution of the code. They can be used to track the evolution of an experiment during the exploratory phase, by recording one thought process in the text cells and the experiments in the code cells. Or to create self-generating experiment reports where the general explanation is in the text and the precise data points are generated by the code in the output cells.

More info is available at Notebooks .

-- Grid'5000 Team 15:00, March 8th 2021 (CET)

Change in the Omnipath network of Grenoble's site: all nodes now connected to a single switch

The Intel Omnipath network of the Grenoble site was changed recently due to technical constraints : It is now composed of only 1 Omnipath switch with all nodes connected to it (32 dahu + 4 yeti + 4 troll), while in the past the network was shared with the compute nodes of the HPC center of Université Grenoble Alpes, with the yeti and troll nodes connected to a different switch than the dahu nodes.

Please take that change into consideration in your experimentation.

More info: Grenoble:Network

-- Grid'5000 Team 15:00, March 3rd 2021 (CET)

Kwapi service is going to be stopped

Kwapi, the legacy service to provide energy consumption monitoring under Grid'5000, is going to be retired. It will be stopped in two weeks.

We encourage you to adopt its successor: Kwollect https://www.grid5000.fr/w/Monitoring_Using_Kwollect

-- Grid'5000 Team 08:30, February 10th 2021 (CET)

IBM POWER8 cluster "drac" with P100 GPUs fully available in Grenoble

After a few months of testing, we are happy to announce that the new cluster "drac" is now available in the default queue in Grenoble.

The cluster has 12 "Minsky" nodes from IBM, more precisely "Power Systems S822LC for HPC". Each node has 2x10 POWER8 CPU cores, 4 Tesla P100 GPU, and 128 GB of RAM. The GPUs are interconnected pair-wise with a NVLINK fabric that is also connected to the CPUs (unlike the DGX-1 gemini nodes for instance). The nodes are interconnected with a high-speed Infiniband network with 2x100 Gbit/s on each node.

Be aware that this cluster is using an exotic architecture, and as such it may be more difficult than usual to install software. In particular, if you need to install to deep learning frameworks on these nodes, make sure to read our documentation:

Deep_Learning_Frameworks#Deep_learning_on_ppc64_nodes

In addition, when reserving the nodes, you need to explicitly ask for an "exotic" job type. For instance, to obtain a single node:

grenoble> oarsub -I -t exotic -p "cluster='drac'"

Acknowledgment: This cluster was donated by [GENCI https://www.genci.fr/en] (it was formerly known as the Ouessant platform of Genci's Cellule de Veille Technologique), many thanks to them.

More information on the hardware is available at Grenoble:Hardware#drac

-- Grid'5000 Team 17:30, February 9th 2021 (CET)

Troll and Gemini clusters are now exotic resources (change in the way to reserve them)

Clusters Troll on Grenoble and Gemini on Lyon are now considered exotic resources, and must be reserved using the exotic OAR job type. When a cluster on Grid'5000 has a hardware specificity that makes it too different from a "standard" configuration, it is reservable only using the exotic OAR job type. There are 2 reasons for this:

  • It ensures that your experiment won't run on potentially incompatible hardware, unless you explicitly allow it. (for example, you don't want to get a aarch64 cluster if your experiment is built for x86)
  • By not allocating these resources to jobs by default, it makes them more easily available for users who are looking specifically for this kind of hardware.

There is an example of usage of the exotic job type in the getting started : https://www.grid5000.fr/w/Getting_Started#Selecting_specific_resources

You can see if a cluster is exotic in the reference API or on the Hardware page of the wiki : https://www.grid5000.fr/w/Hardware#Clusters There are currently 4 clusters which needs the exotic job type to be reserved :

  • pyxis because it has a non-x86 CPU architecture (aarch64)
  • drac because it has a non-x86 CPU architecture (ppc64)
  • troll because it has PMEM ( https://www.grid5000.fr/w/PMEM )
  • gemini because it has 8 V100 GPU per node, and only 2 nodes

-- Grid'5000 Team 09:40, January 26th 2021 (CET)

New Grid'5000 API's documentation and specification

Grid'5000 API's documentation has been updated. Before this update, the documentation contained both the specification and tutorials of the API (with some parts also present in the wiki).

To be more consistent, https://api.grid5000.fr/doc/ provides now only the specification (HTTP paths, parameters, payload, …). All tutorials were moved (along with being updated) to the Grid'5000's wiki.

The new API specification can be viewed with two tools: The first one allows to read the specification and find information ; the second one allows to discover the API thanks to a playground.

Please note that the specification may contain errors. Please report any of such errors to Support Staff.

-- Grid'5000 Team 14:30, January 11th 2021 (CET)

Important changes in the privileges levels of users

Each Grid'5000 user is a member of at least one granting access group, which depends on their situation (location, laboratory, ...). Each group is given a privilege level (bronze, silver, gold), depending on how the related organization is involved in Grid'5000's development and support.

Until now, however, these levels had no impact on how Grid'5000 could be used.

Starting from December 10th, 2020, each user will be granted different usages on the testbed depending on their privileges level. In particular:

  • While every level continues to give access to the Grid'5000 default queue (most of Grid'5000 resources) ;
  • Access to the production and besteffort queues will only be granted to silver and gold levels.

The complete description of each level of privileges is available here.

The privilege level of the groups a user is a member of is shown in the "group" tab of the management interface.

Note that if a user is a member of several groups, one is set as default and is implicitly used when submitting jobs. But the "--project" OAR option can also set explicitly which group the job should use. For instance:

oarsub -I -q production --project=myothergroup

Do not hesitate to contact the Support Staff for any questions related to the privilege levels.

-- Grid'5000 Team 15:30, December 8st 2020 (CET)

Reminder: Testing phase of the new monitoring service named Kwollect

As a reminder, the testing phase of Kwollect, the new monitoring solution for Grid'5000, is still ongoing.

Some new features are available since the last announcement :

  • Support for Prometheus metrics
  • Basic visualization dashboard
  • Fine-tuning of on-demand metrics
  • Ability to push your own metrics

See: Monitoring Using Kwollect

Do not hesitate to give us some feedback!

Kwollect is intended to replace the legacy monitoring systems, Kwapi and Ganglia, in the (hopefully) near future.

-- Grid'5000 Team 09:00, December 1st 2020 (CET)

New IBM POWER8 cluster "drac" available for beta testing in Grenoble

We are happy to announce that a new cluster "drac" is available in the testing queue in Grenoble.

The cluster has 12 "Minsky" nodes from IBM, more precisely "Power Systems S822LC for HPC". Each node has 2x10 POWER8 cores, 4 Tesla P100 GPU, and 128 GB of RAM. The GPUs are directly interconnected with a NVLINK fabric. Support for Infiniband 100G and kavlan is planned to be added soon.

This is the first cluster in Grid'5000 using the POWER architecture, and we are also aware of some stability issues related to GPUs, hence the "beta testing" status for now. Feedback is highly welcome on performance and stability, as well as on our software environment for this new architecture.

Note that, since the CPU architecture is not x86, you need to explicitly ask for an "exotic" job type when reserving the nodes. For instance, to get a single GPU:

grenoble$ oarsub -I -q testing -t exotic -l gpu=1

More information on the hardware is available at Grenoble:Hardware

Acknowledgment: This cluster was donated by GENCI, many thanks to them. It was formerly known as the Ouessant platform of Genci's Cellule de Veille Technologique.

-- Grid'5000 Team 15:50, November 26th 2020 (CET)

OAR job container feature re-activated

We have the pleasure to announce that the OAR job container feature has been re-activated. It allows to execute inner jobs within the boundaries of a container job.

It can be used, for example, by a professor to reserve resources with a container job for a teaching lab, and let students run their own jobs inside that container.

More informations on job containers here

Please note that if a single user needs to run multiple tasks (e.g. SPMD) within a bigger resource reservation, it is preferable to use a tool such as GNU Parallel: GNU Parallel

-- Grid'5000 Team 15:40, November 19th 2020 (CET)

Grid'5000 global vlans and stitching now available through the Fed4FIRE federation

We announced in June that Grid'5000 nodes were now available through the Grid'5000 Aggregate Manager to users of the Fed4FIRE testbed Federation. Fed4FIRE is the largest federation worldwide of Next Generation Internet (NGI) testbeds, which provide open, accessible and reliable facilities supporting a wide variety of different research and innovation communities and initiatives in Europe.

We now have the pleasure to announce that the Aggregate Manager now allows the reservation of Grid'5000 global-vlans and inter-testbed stitching throught the federation's jFed-Experiment application, and tools designed to access GENI testbeds.

Inter-testbed stitching allows users to link Grid'5000 global-vlans to external vlans linked to other testbeds in the federation. Using this technology user can setup experiments around wide-area l2 networks.

Grid’5000 users wanting to use the aggregate manager should link their Fed4FIRE account to their Grid’5000 one using the process described in the wiki.

-- Grid'5000 Team 10:00, November 16th 2020 (CET)

New cluster grappe available in the production queue

We have the pleasure to announce that a new cluster named "grappe" is available at Nancy¹.

We chose that name to celebrate the famous alsatian wines, as the network part of grappe was wrongly delivered in Strasbourg. We interpreted that funny event as a kind of "birth origin".

The cluster features 16 Dell R640 nodes with 2 Intel® Xeon® Gold 5218R (20 cores/CPU), 96 GB of RAM, a 480 GB SSD + a 8.0 TB HDD reservable disk and 25 Gbps Ethernet.

Energy monitoring² is available for the cluster.

The cluster has been funded by the CPER IT2MP (Contrat Plan État Région, Innovations Technologiques, Modélisation & Médecine Personnalisée) and FEDER (Fonds européen de développement régional)³.

¹: https://www.grid5000.fr/w/Nancy:Hardware

²: https://www.grid5000.fr/w/Energy_consumption_monitoring_tutorial

³: https://www.loria.fr/fr/la-recherche/axes-scientifiques-transverses/projets-sante-numerique/

-- Grid'5000 Team 14:30, October 16th 2020 (CET)

ARM64 cluster Pyxis available in the default queue

We have the pleasure to announce that the "pyxis" cluster in Lyon is now available in the default queue ! It is composed of 4 nodes with ARM64 CPUs (ThunderX2 9980 with 2x32 cores), 256 GB RAM, 2 x 250 GB HDD, 10 Gbps Ethernet, 100 Gbps Infiniband interface. Each node power consumption is monitored with the Lyon wattmetre.

Pyxis nodes must be reserved using the "exotic" oar job type (add "-t exotic -p cluster='pyxis' " to your OAR submission). Several arm64 environments are available to be deployed on this cluster: https://www.grid5000.fr/w/Advanced_Kadeploy#Search_and_deploy_an_existing_environment

This cluster has been funded by the CPER LECO++ Project (FEDER, Région Auvergne-Rhone-Alpes, DRRT, Inria).

-- Grid'5000 Team 10:50, September 30th 2020 (CEST)

IPv6 on Grid'5000

Grid'5000 now offers IPv6 connectivity to Internet from nodes !
To get a global IPv6 address on an interface, you can run dhclient -6 <interface> on your node.

Inside Grid'5000, new DNS entries have been added. For example :

  • dahu-1-ipv6.grenoble.grid5000.fr for the IPv6 address on the main interface of dahu-1
  • chetemi-2-eth1-ipv6.lille.grid5000.fr for the IPv6 address on the secondary interface of chetemi-2

Please note that IPv6 connectivity to Internet is now available but we are still working on some improvements and additional features such as :

  • IPv6 addresses for Kavlan. It is only for production network for the moment.
  • A tool to open access to your nodes from Internet. Incoming traffic from Internet to nodes is currently filtered.

-- Grid'5000 Team 16:00, September 17th 2020 (CEST)

Refactoring of the users' portal

We refactored the users' portal page of the Grid'5000 website, in order to give a better structure to the set of tutorials and guides. Some titles were slightly modified as well as some content, but you should be able to find all the information you were using so far easily. For new comers, we hope it will be easier to find out efficiently how to use the testbed and its many features.

Also note that we removed the list of pages automatically provided by the wiki engine at the bottom, as we think this was not so relevant. All pages, even the outdated ones, can however still be found in the wiki.

-- Grid'5000 Team 16:00, July 9th 2020 (CEST)

Grid'5000 General Conditions of Use now available

To clarify the legal frame and the application of law surrounding Grid'5000 usage, we added a "General Conditions of Use" document to the Usage Policy.

From now on, usage of Grid'5000 must be in accordance with those two documents:

-- Grid'5000 Team 17:00, June 30th 2020 (CEST)

New monitoring service for Grid'5000 available for beta testing

We are happy to announce a new service to provide environmental and performance metrics monitoring for Grid'5000 experiments.

The service uses the Kwollect framework and currently makes following metrics available: - Energy consumption from dedicated “wattmetre” devices (currently available for some clusters in Lyon, Grenoble, Nancy) - Metrics collected from nodes’ Board Management Controller (out-of-band management hardware, such as Dell iDRAC), such as ambient temperature, hardware component temperature, energy consumption from PSU, fan speed, etc. - Traffic collected from network devices - Energy consumption from PDU, when available

It is also planned to add more metrics in the future.

The service documentation is available at: https://www.grid5000.fr/w/Monitoring_Using_Kwollect

The service is currently in beta. Many things may change in the future. Feel free to give us your feedback at support-staff@lists.grid5000.fr

-- Grid'5000 Team 16:00, June 30th 2020 (CEST)

Grid’5000 resources now available through the Fed4FIRE federation.

We have the pleasure to announce that Grid’5000 is now accessible through the Fed4FIRE European testbed federation. https://www.grid5000.fr/w/Fed4FIRE

Fed4FIRE is the largest federation worldwide of Next Generation Internet (NGI) testbeds, which provide open, accessible and reliable facilities supporting a wide variety of different research and innovation communities and initiatives in Europe.

The Grid’5000 Aggregate Manager will allow the provisioning of Grid’5000 nodes to members of the federation using the federation’s jFed-Experiment application, or tools designed to access GENI testbeds. Please note that functions related to VLANs and inter-testbed stitching are still in development, and cannot yet be performed through the Aggregate Manager.

Grid’5000 users wanting to use the aggregate manager should link their Fed4FIRE account to their Grid’5000 one using the process described in the wiki.

-- Grid'5000 Team 11:30, June 29th 2020 (CEST)

A short documentation on installing Deep Learning frameworks under Grid'5000

We wrote a short documentation to explain how to install popular Deep Learning frameworks in Grid'5000. It is available at: https://www.grid5000.fr/w/Deep_Learning_Frameworks

Note that the installation steps in the documentation are regularly executed under our testing infrastructure to ensure these tools work as expected at any time.

-- Grid'5000 Team 09:49, June 29th 2020 (CEST)

Integration of groups granting access with OAR

We are pleased to announce a new feature to make OAR works with Grid'5000 user groups (groups granting access).

When reserving resources, OAR will automatically add the group (project) of the user to the job metadata, for accounting purposes.

For users belonging to several groups, the default one will be used. If no default is set, an error will be displayed asking to use the --project option of oarsub, or to select the default group in the User Management System (https://api.grid5000.fr/stable/users/).

A short summary about the addition of groups granting access on Grid'5000 can be found in this previous news.

-- Grid'5000 Team 11:43, June 23th 2020 (CEST)

New OAR jobs types to restrict jobs to daytime or night/week-end time

We are pleased to announce a new feature to help for the submission of batch jobs (which differ from advance reservations) respecting the day vs. night or week-end time frames defined in the usage policy.

It can be activated using the oarsub -t <job type> option as follows:

oarsub -t day ... to submit a job to run during the current day time. As such:

  • It will be forced to run between 9:00 and 19:00, or the next day if the job is submitted during the night.
  • If the job did not succeed to run before 19:00, it will be deleted.

oarsub -t night ... to submit a job to run during the coming (or current) night (or week-end on Friday). As such:

  • It will be forced to run after 19:00, and before 9:00 for week nights (Monday to Thursday nights), or before 9:00 on the next Monday for a job which runs during a week-end.
  • If a job could not be scheduled during the current night because not enough resources are available, it will be kept in the queue for an automatic retry the next night (hour constraints will be changed to the next night slot), that for 7 days.
  • If the walltime of the job is more than 14h, the job will obviously not run before a weekend.

Note that:

  • the maximum walltime for a night is 14h, but due to some overhead in the system (resources state changes, reboots...), it is strongly advised to limit walltime to at most 13h30. Furthermore, a shorter walltime (max a few hours)? will result in more chances to get a job scheduled in case many jobs are already in queue.
  • jobs with a walltime greater than 14h will be required to run during the week-ends. But even if submitted at the beginning of the week, they will not be scheduled before the Friday morning. Thus, any advance reservation done before Friday will take precedence. Also, given that the rescheduling happens on a daily basis for the next night, advance reservations take precedence if they are submitted before the daily rescheduling. In practice, this mechanism thus provides a low priority way to submit batch jobs during nights and week-ends.
  • a waiting job will be kept max 7 days in queue before deletion if it could not run because of lack of resources.

-- Grid'5000 Team 11:41, June 23th 2020 (CEST)

New ARM cluster "Pyxis" available on Lyon (testing queue)

We have the pleasure to announce that a new cluster named "pyxis" is available in the testing queue in Lyon. It is the first cluster with ARM CPU (ThunderX2 9980) in Grid'5000 ! This cluster is composed of 4 nodes, each of them with 2 ThunderX2 9980 CPU (32 cores/CPU, 4 threads/cores), 256 GB RAM, 2 x 250 GB HDD, 10 Gbps Ethernet. Each node's power consumption is monitored with the wattmetre of Lyon (although it is currently broken¹). It is planned to add Infiniband network to Pyxis.

Pyxis nodes can be reserved in the testing queue with the exotic job type (add "-q testing -t exotic -p cluster='pyxis'" to your OAR submission) and arm64 environments are available to be deployed on this cluster. Beware that as it is a different CPU architecture, compiled programs targeting x86 (such as those provided by module) won't execute. Any feedback is welcome.

This cluster has been funded by the CPER LECO++ Project (FEDER, Région Auvergne-Rhone-Alpes, DRRT, Inria).

¹: https://intranet.grid5000.fr/bugzilla/show_bug.cgi?id=11784

-- Grid'5000 Team 11:20, May 14th 2020 (CEST)

Ubuntu 20.04 image available

A kadeploy image (environment) for Ubuntu 20.04 (ubuntu2004-x64-min) is now available and registered in all sites with Kadeploy, along with other supported environments (Centos 7, Centos 8, Ubuntu 18.04, Debian testing, and various Debian 9 and Debian 10 variants).

This image is built with Kameleon (just like other Grid'5000 environments). The recipe is available in the environments-recipes git repository.

If you need other system images for your work, please let us know.

-- Grid'5000 Team 14:00, April 27th 2020 (CEST)

Singularity containers in Grid'5000

We now offer a better support for Singularity containers Grid'5000. It is available in the standard environment and does not require to be root.

Just run the "singularity" command to use it. It can also be run in a OAR submission (none-interactive batch job), for instance:

oarsub -l core=1 "/grid5000/code/bin/singularity run library://sylabsed/examples/lolcow"

More information about the Singularity usage in Grid'5000 is available in the Singularity page.

Singularity is a popular container solution for HPC systems. It natively supports GPU and high performance network in containers and is compatible with docker images. More info at: https://sylabs.io/docs/

-- Grid'5000 Team 10:07, April 23nd 2020 (CET)

Important change in the User Management System - Groups which Grant Access to the testbed

An important update to the Grid'5000 User Management System just happened. This update brings a new concept: users now get granted access to the testbed through a group membership.

These Groups which Grant Access (GGAs) allow a dispatch of the management and reporting tasks for the usage of the platform to closer managers than the Grid’5000 site managers.

Every user has to be a member of a GGA to be allowed access to the platform. The memberships are currently being worked by the staff, the site managers and new GGA managers.

You may receive emails about moves regarding your account: Don't worry. The transition to GGA should not impact your use of the platform and experimentations.

As a reminder, if you encounter problems or have questions, please report them either on the users mailing list or to the support staff, as described in the Support page. More information about this change is available in the User Management Service documentation page.

-- Grid'5000 Team 11:07, April 22nd 2020 (CET)

Using GNU Parallel in Grid'5000

The GNU Parallel tool, which is very relevant to complement OAR for many-tasks workloads following for instance the SPMD / embarrassingly parallel scheme, now benefits from a documentation page for its usage in Grid'5000.

GNU Parallel can be used within a OAR job, as the launcher for the many tasks. Thanks to oarsh, it can also dispatch many single GPU tasks on many GPUs.

See the the documentation page: GNU Parallel.

-- Grid'5000 Team 16:25, April 17th 2020 (CET)

Major update of BIOS and other firmware and future strategy

In the recent months, we have performed a campaign of firmware updates (BIOS, Network interface Cards, RAID adapters…) of the nodes of most Grid'5000 clusters.

Those updates improved the overall reliability of our deployment process, but they also included mitigations for security issues such as Spectre/Meltdown.

It was also an opportunity to align clusters with similar hardware on the same firmware versions.

Unfortunately, we understand that those changes may have an impact on your experiments (particularly in terms of performance). This is a difficult issue where there is no good solution, as it is often hard or impossible to downgrade BIOS versions.

However, those firmware versions are included in the reference API. We recommend that you use this information to track down changes that could affect your experiment.

For instance, in https://api.grid5000.fr/stable/sites/nancy/clusters/gros/nodes/gros-1.json?pretty=1 , see bios.version and firmware_version. You can also browse previous versions using the API ¹, or using the Github commit history ²

We will continue to update such firmware in the future, about twice a year, keeping similar hardware in sync, and documenting the versions in the reference API.

¹: https://api.grid5000.fr/doc/3.0/reference/spec.html#get-3-0-item-uri-versions

²: https://github.com/grid5000/reference-repository/commits/master

-- Grid'5000 Team 16:15, March 27th 2020 (CET)

Support for persistent memory (PMEM)

Grid'5000 now features, among the different technologies it provides, some nodes with persistent memory.

Please find an introduction and some documentation on how to experiment on the persistent memory technology in the PMEM page.

-- Grid'5000 Team 17:35, February 19th 2020 (CET)

New cluster "troll" available in Grenoble

We have the pleasure to announce that a new cluster called "troll" is available in Grenoble¹.

It features 4 Dell R640 nodes with 2 Intel® Xeon® Gold 5218, 16 cores/CPU, 384GB DDR4, 1.5 TB PMEM (Intel® Optane™ DC Persistent Memory)²³, 1.6 TB NVME SSD, 10Gbps Ethernet, and 100Gb Omni-Path.

Energy monitoring⁴ is available for this cluster, provided by the same devices used for the other clusters in Grenoble.

This cluster has been funded by the PERM@RAM project from Laboratoire d'Informatique de Grenoble (CNRS/INS2I grant).

¹: https://www.grid5000.fr/w/Grenoble:Hardware

²: https://software.intel.com/en-us/articles/quick-start-guide-configure-intel-optane-dc-persistent-memory-on-linux

³: https://docs.pmem.io/persistent-memory/

⁴: https://www.grid5000.fr/w/Energy_consumption_monitoring_tutorial

-- Grid'5000 Team 17:00, February 3rd 2020 (CET)

New cluster available in Nancy: grue (20 GPUs)

We have the pleasure to announce that the Grue cluster in Nancy¹ (production queue) is now available:

It features 5 Dell R7425 servers nodes with four Tesla T4², 128 GB DDR4, 1x480 GB SSD, 2 x AMD EPYC 7351, 16 cores/CPU As this cluster features 4 GPU per node, we remind you that you can monitor GPU (and node) usage using the Ganglia tool (std environment only) and looking a the grue nodes.

If your experiments do not require all the GPU of a single node, it is possible to reserve resources at the GPU level³ (also see this previous news for some examples). You can also use the nvidia-smi and htop commands on your reserved nodes to get more information about your GPU and CPU usage.

This cluster has been funded by Ihe CPER LCHN project (Langues, Connaissances & Humanités Numériques, Contrat de plan État / Région Lorraine 2015-2020), and the LARSEN and MULTISPEECH teams at LORIA / Inria Nancy Grand Est.

As a reminder, since this cluster is part of the "production" queue, specific usage rules apply.

¹: https://www.grid5000.fr/w/Hardware

²: https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-t4/t4-tensor-core-datasheet-951643.pdf

³: https://www.grid5000.fr/w/Accelerators_on_Grid5000#Reserving_GPU_units_on_nodes_with_many_GPUs

-- Grid'5000 Team 13:10, January 29th 2020 (CET)

Grid'5000 users survey

We are conducting a survey to help us better understand your needs and make Grid'5000 a better research infrastructure.

We thank you in advance for taking a few minutes to complete it (you can answer in French if you prefer).

The survey is available at: https://sondages.inria.fr/index.php/672895

It will be open until December, 13rd.

-- Grid'5000 Team 15:00, November 26th 2019 (CET)

New cluster "gemini" available at Lyon

We have the pleasure to announce you the availability of the new cluster "gemini" at Lyon.

Gemini includes two "Nvidia DGX-1" nodes, each with 8 Nvidia V100 GPUs, 2 Intel Xeon E5-2698 v4 @ 2.20GHz CPUs, 512GB DDR4, Infiniband EDR and 10Gbps Ethernet interfaces and 4 reservable¹ SSD disks. Energy monitoring is also available for this cluster, provided by the same devices used for the other clusters in Lyon².

Remember that if you don't need the 8 GPUs, individual GPU may be reserved³. A script to install nvidia-docker is also available if you want to use Nvidia's images built for Docker⁴.

This cluster has been funded by the CPER LECO++ Project (FEDER, Région Auvergne-Rhone-Alpes, DRRT, Inria).

¹: https://www.grid5000.fr/w/Disk_reservation

²: https://www.grid5000.fr/w/Energy_consumption_monitoring_tutorial

³: https://www.grid5000.fr/w/Accelerators_on_Grid5000#Reserving_GPU_units_on_nodes_with_many_GPUs

⁴: https://www.grid5000.fr/w/Docker#Nvidia-docker

-- Grid'5000 Team 15:00, November 12th 2019 (CET)

Group Storage service officially available

We are glad to announce that the Group Storage service is now no longer in beta. Additionally to the two servers in Lille, 5 storage servers in Lyon, Rennes, Nancy, Luxembourg and Sophia are now providing Group Storage volumes on demand.

See https://www.grid5000.fr/w/Group_Storage for details and https://www.grid5000.fr/w/Storage for an overview of our storage options.

-- Grid'5000 Team 11:00, November 12th 2019 (CET)

New cluster available in Nancy: gros (124 nodes with reservable disks)

We have the pleasure to announce that the Gros cluster in Nancy (default queue) is now available:

It features 124 Dell R640 servers nodes with one Intel Gold 5220 (18 cores / 36 threads), 96 GB DDR4, 1x480 GB SSD + 1x960 GB reservable SSD [1] and 2x25Gbps Ethernet interfaces.

However, the installation of this cluster is not yet fully completed:

  • Two nodes (gros-3 and gros-4) are in an artificial Dead state due to a kavlan issue we still have to fix with the new model of the network switches of Gros network.
  • This cluster will be equipped with a high precision monitoring system for electrical power consumption (Multilab-R, from Omegawatt [2]) that is already used in Grid'5000 at Lyon [3] or Grenoble [4]. It will be available for general use by the beginning of December.
  • At the present time, the uplink between the Gros cluster and the Grid5000 network operates at 10Gbs. It will improved up to 2x40Gb/s in the next month.

This cluster has been funded by the CPER Cyber Entreprises project (Contrat de plan État / Région Lorraine 2015-2020).

Again, many thanks go to the former SIC and SG services at LORIA / Inria Nancy Grand Est for their help and support during this operation.

About the name: Gros is a french adjective which usually means large, big or fat, but when used as a noun in Lorraine (which is the region Nancy belongs to), it means something else. It is used here generally to talk about people one cares (friends, children, companions). Some examples:

  • Comment que c'est gros ? (how do you do buddy ?). This is the most famous one, and we are pleased to shout it out loud when we enter into gros's server room. It seems to be spreading outside Lorraine as well[5].
  • Les gros sont à l'école (children are at school)
  • Wha gros c'est schteuff ! (Listen my friend, it's amazing !)


[1] https://www.grid5000.fr/w/Disk_reservation [2] http://www.omegawatt.fr/ [3] https://www.grid5000.fr/w/Lyon:Wattmetre [4] https://www.grid5000.fr/w/Grenoble:Wattmetre [5] https://france3-regions.francetvinfo.fr/grand-est/avis-aux-lorrains-non-gros-n-est-pas-expression-francilienne-vient-bien-du-parler-lorrain-1688758.html

-- Grid'5000 Team 12:00, November 8th 2019 (CET)

CentOS 8 now available for kadeployment

We are please to announce that CentOS 8 (centos8-x64-min) is now available and registered in all sites with Kadeploy, along with CentOS 7 (centos7-x64-min), Ubuntu 18.04 (ubuntu1804-x64-min), Debian testing (debiantesting-x64-min) and the Debian 9 and 10 variants.

Feedback from users is very welcome (issues, missing features, etc.).

The CentOS 8 environment is built with Kameleon (just like other Grid'5000 environments).

The recipe is available in the master branch of the environments-recipes git repository.

Please note that Redhat chose to not support anymore some old SAS adapters (RAID cards), which impacts some of the Grid'5000 clusters as they cannot boot CentOS 8 due to that reason:

  • Graphite (Nancy)
  • Paranoia (Rennes)
  • Suno (Sophia)
  • Sagittaire (Lyon)
  • Granduc (Luxembourg)

A workaround may be provided in the future.

If you need other system images for your work, please let us know.

-- Grid'5000 Team 11:00, October 28th 2019 (CET)

End of support for Debian 8 "jessie" environments

Since debian10  ("buster") environments have been released recently, we are going to drop support for the jessie/debian8 environments on October 31st 2019.

The last version of jessie environments i.e 2019081315 will remain available on /grid5000. You can still access older versions of jessie environments under the archive directory (see /grid5000/README.unmaintained-envs for more information).

-- Grid'5000 Team 10:30, October 16th 2019 (CET)

Default environments under Debian 10

As announced, since September 26th 2019, the default environment and the frontend servers are now under Debian 10.

-- Grid'5000 Team 10:20, October 16th 2019 (CET)

Default environments and frontends to be updated to Debian 10 "Buster"

We are planning to upgrade nodes default environement (the "-std" variant, provided when deployment is not used) to Debian 10 "Buster". Switchover is scheduled on September, 26th.

In addition, frontends will also be upgraded on October 3rd (It has already been upgraded on Luxembourg site).

Remind that this update may need you to modify your scripts (of course you will still be able to deploy Debian 9 environments).

-- Grid'5000 Team 12:00, 20 September 2019 (CET)

Debian 10 "buster" environments are now available for deployments

Kadeploy images (from min to big) are now available for Debian 10.

New features and changes in Debian 10 are described in: https://www.debian.org/releases/buster/amd64/release-notes/ch-whats-new.en.html

In particular, it includes many software updates:

  • openjdk 11.0.4
  • Python 3.7.3
    • python-numpy 1.16.2
    • python-scipy 1.2.2
    • python-pandas 0.23.3
  • OpenMPI 3.1.3
  • GCC 7.4 and 8.3
  • G++ 8.3.0
  • Libboost 1.67.0.2
  • Ruby 2.5.1
  • Cuda 10.1.168 (418.67)
  • CMake 3.13.4
  • GFortran 8.3.0
  • Liblapack 3.8.0
  • libatlas 0.6.4
  • RDMA 24.0

Known regressions and problems are :

Let us know if you want us to support some tools, software,... that are not available on big and std.

As a reminder, this is the way you can use an alternate environment (https://www.grid5000.fr/w/Getting_Started#Deploying_nodes_with_Kadeploy) :

 $ oarsub -t deploy -I
 $ kadeploy -e debian10-x64-big -f $OAR_NODE_FILE -k

Please report any problem you encounter with the above environments to support-staff@lists.grid5000.fr.

-- Grid'5000 Team 17:00, 30 August 2019 (CET)

New cluster available in Nancy: graffiti (52 GPUs)

We have the pleasure to announce that the Graffiti cluster in Nancy (production queue) is now fully operational:

It features 13 Dell T640 servers nodes with 2 Intel Silver 4110 (8 cores / 16 threads), 128 GB DDR4, 1x480 GB SSD, 10Gbps Ethernet, and 4 GPU NVidia RTX2080 Ti per node.

As this cluster features 4 GPU per node, we remind you that you can monitor GPU (and node) usage using the Ganglia tool (std environment only):

You can also use the nvidia-smi and htop commands on your reserved nodes to get more information about your GPU/CPU usage.

If your experiments do not require all the GPU of a single node, it is possible to reserve GPU at the resource level (see https://grid5000.fr/w/News#Enabling_GPU_level_resource_reservation_in_OAR for some examples).

Finally, if you know how to use GPUs at their max with widely used software (Tensorflow, NAMD, ...) and would like to share your knowledge about this, we will be happy to transform your knowledge into Grid5000 tutorials.

This cluster has been funded by Ihe CPER LCHN project (Langues, Connaissances & Humanités Numériques, Contrat de plan État / Région Lorraine 2015-2020), and the LARSEN and MULTISPEECH teams at LORIA / Inria Nancy Grand Est.

Special thanks go to:

  • Marc Vesin (Inria Sophia) and Marianne Lombard (Inria Saclay) for their contributions and sharing of experiences on the café cluster mailing list which have been decisive for the development of this cluster
  • The SIC and SG services at LORIA / Inria Nancy Grand Est for their help and support during this operation

As a reminder, since this cluster is part of the "production" queue, specific usage rules apply.

-- Grid'5000 Team 17:40, 27 August 2019 (CET)

A new version of tgz-g5k has been released

We have released a new version of tgz-g5k. Tgz-g5k is a a tool that allow you to extract a Grid'5000 environment tarball from a running node. The tarball can therefore be used by kadeploy to re-deploy the image on different nodes/reservations (see Advanced Kadeploy for more details)

The new version has two major improvements:

  1. tgz-g5k is now compatible with Ubuntu and Centos
  2. tgz-g5k is directly usable on frontends (you do not need to use it through ssh anymore).

To use tgz-g5k from a frontend, you can execute the following command:

  frontend$ tgz-g5k -m MY_NODE -f ~/MY_TARBALL.tgz

In case of specific or non-deployed environments: - tgz-g5k can use a specific user id to access nodes, by using the parameter -u (by default tgz-g5k accesses nodes as root) - tgz-g5k can access node with oarsh/oarcp instead of ssh/scp, by using the parameter -o (by default tgz-g5k uses ssh/scp)

Note that tg5-g5k is still compatible with the previous command line. For the record, you had to use previously the following command:

  frontend$ ssh root@MY_NODE tgz-g5k > ~/MY_TARBALL.tgz

-- Grid'5000 Team 15:00, 07 August 2019 (CET)

Enabling GPU level resource reservation in OAR

We have now put in service the GPU level resource reservation in OAR in Grid'5000. OAR now allows one to reserve one (or some) of the GPUs of a server hosting many GPUs, letting the other GPUs of the server available for other jobs.

Only the reserved GPUs will be available for computing in the job, as reported by the nvidia-smi command for instance.

  • To reserve one GPU in a site, one can now run: $ oarsub -l gpu=1 ...
  • To reserve 2 GPUs on a host which possibly has more than 2 GPUs, one can run: $ oarsub -l host=1/gpu=2 ...
  • To reserve whole nodes (servers) with all GPUs, one can still run: $ oarsub -l host=3 -p "gpu_count > 0"
  • To reserve specific GPU models, one can use the "gpu_model" property in a filter: $ oarsub -l gpu=1 -p "gpu_model = 'Tesla V100'"
  • One can also filter on the cluster name after looking at the hardware pages for the description of the clusters: $ oarsub -l gpu=1 -p "cluster = 'chifflet'"

Finally, please notice that the drawgantts offer options to display GPUs.

-- Grid'5000 Team 17:00, 10 July 2019 (CET)

Network-level federation available between Grid'5000 and Fed4FIRE (and beyond)

It is now possible to connect Grid'5000 resources to other testbeds from Fed4FIRE. This is implemented by on-demand "stitching" between KaVLAN VLANs and VLANs provided by RENATER and GEANT that connect us to a Software Defined Exchange (SDX) hosted by IMEC in Belgium. This SDX is also connected to other SDXes in the US, which should make it possible to combine resources from Grid'5000 and US testbeds such as GENI, Chameleon or CloudLab.

For more information, see https://www.grid5000.fr/w/Fed4FIRE_VLAN_Stitching

-- Grid'5000 Team 15:00, 9 July 2019 (CET)

Grid'5000 links to the Internet upgraded to 10Gbps

Thanks to Renater which provides Grid'5000 network connectivity, we just upgraded our connection to the Internet to 10Gbps.

You should experience faster downloads and an increased speed when loading your data on the platform!

-- Grid'5000 Team 08:00, 9 July 2019 (CET)

TILECS Workshop -- all presentations available

The TILECS workshop (Towards an Infrastructure for Large-Scale Experimental Computer Science) was held last week in Grenoble. About 90 participants gathered to discuss the future of experimental infrastructures for Computer Science.

The slides for all presentations are now available on the workshop website: https://www.silecs.net/tilecs-2019/programme/

-- Grid'5000 Team 10:00, 8 July 2019 (CET)

Several engineer and post-doc positions to work on Grid'5000 and Fed4FIRE

We are hiring! Several positions are open.

Two engineer positions:

Those two positions can be located in Grenoble, Lille, Lyon, Nancy, Nantes, Rennes or Sophia Antipolis. Both freshly graduated engineers and experienced engineers are welcomed (salary will depend on profile).

We are also hiring a post-doc in Nancy, in the context of Grid'5000 and the Fed4FIRE testbeds federation, on experimentation and reproducible research (see job offer)

To apply to any of those positions, use the Inria website or contact lucas.nussbaum@loria.fr directly with motivation letter and curriculum vitae.

-- Grid'5000 Team 10:00, 28 June 2019 (CET)

OAR properties upgrade for jobs using GPUs

In order to allow GPU level resource reservations in OAR, we have to adapt some OAR properties.

Currently, there are two GPU-related properties that one can use in OAR: gpu and gpu_count (see: https://www.grid5000.fr/w/OAR_Properties#gpu ).

They can for instance allow one to reserve 3 hosts equipped with 2 GPUs GTX 1080 Ti: $ oarsub -p "gpu='GTX 1080 Ti' and gpu_count='2'" -l host=3 ...

The first required change is to rename the 'gpu' property to 'gpu_model'. The new oarsub command will hence become: $ oarsub -p "gpu_model='GTX 1080 Ti' and gpu_count='2'" -l host=3 ...

(the gpu property will later be repurposed to give the identifier of the GPU resource instead of the model)

Please make sure to update your scripts according to this change.

A maintenance will be operated on 2019-06-13 between 10:00 and 12:00 where all Grid'5000 will be unavailable for reservation, in order to achieve this change.

In second stage, in a future maintainance, we will add new OAR properties, allowing GPU level resource reservations.

-- Grid'5000 Team 10:00, 12 June 2019 (CET)

TILECS workshop -- Towards an Infrastructure for Large-Scale Experimental Computer Science

Grid'5000 has joined forced with the FIT community to build SILECS, an unprecedented tool to address challenges from today's digital sciences.

The TILECS workshop is an event aiming at getting a better understanding of (1) experimental needs in the targeted fields; (2) existing testbeds. The end goal is to define a roadmap and priorities for the SILECS project.

See the full announcement on the workshop website.

-- Grid'5000 Team 10:00, 27 May 2019 (CET)

More than 2000 documents in Grid'5000's HAL collection!

Grid'5000 uses a collection in the HAL Open Archive to track publications that benefited from the testbed. This collection now includes more than 2000 documents, including 1396 publications (journals and conferences articles, book chapters, etc.), 268 PhD theses, 44 HDRs, and 302 unpublished documents such as research reports !

-- Grid'5000 Team 10:00, 13 May 2019 (CET)

Availability of scientific and HPC software using "module" command

Additional scientific and HPC related software are now available on Grid'5000 nodes and frontends, using the "module" command. For instance, newest versions of GCC, Cuda, OpenMPI are available this way.

To get the list of available software, use:

 module av                                                                     

To load a specific software into your environment, use:

 module load <software_name>                      

Please note that module command overloads some of your environment variables, such $PATH (internally we use Spack to build those modules).

Please also note that some software requires an access to a licence server.

More information at : https://www.grid5000.fr/w/Software_using_modules

-- Grid'5000 Team 17:00, 6 May 2019 (CET)

Pierre Neyron awarded Médaille de Cristal du CNRS

Pierre Neyron, who has been a member of the Grid'5000 technical staff since the early years of the project, has been awarded the "Médaille de Cristal de CNRS".

Quoting http://www.cnrs.fr/fr/talents/cnrs?medal=42 :

La médaille de cristal distingue les ingénieurs, techniciens et agents administratifs qui, par leur créativité, leur maîtrise technique et leur sens de l'innovation, contribuent aux côtés des chercheurs à l'avancée des savoirs et à l'excellence de la recherche française.

Congratulations Pierre!

-- Grid'5000 Team 13:00, 16 April 2019 (CET)

Disk reservation feature update: more clusters and usable from the standard environment (sudo-g5k)

The disk reservation feature, which allows one to reserve on node local disks, is now available on more clusters:

See the Disk reservation tutorial to understand how to take benefit from this feature.

Also please note that reserved disks are now also exploitable using sudo-g5k in the standard environment (does not require a deploy job).

-- Grid'5000 Team 13:00, 20 March 2019 (CET)

New group storage service in beta testing, to replace storage5k

In order provide large, persistent and shareable storage among a group of user, we are introducing a new storage service. This service is now available for beta testing at Lille: if you need to store large amounts of data in Grid'5000, we recommend that you try it!

See https://www.grid5000.fr/w/Group_Storage for details.

The service will be extended to some other sites after the beta testing phase.

This service aims at replacing storage5k, and existing storage5k services will be shutdown. On Nancy and Luxembourg, the storage5k service will be retired on 2019-03-19, to free space in the server room for new machines. Other storage5k servers (on other sites) will be shutdown after the end of the beta phase of the new service.

If you are currently using Storage5k, your best options are:

  • move data to your home directory (after requesting a disk quota extension if needed)
  • move data to OSIRIM
  • move data to the group storage service in Lille

See https://www.grid5000.fr/w/Storage for an overview of our storage options. (Note that we currently have two open issues with quota extensions and OSIRIM. If you submit a quota extension but do not receive a reply, or if you cannot access OSIRIM, please contact support-staff@lists.grid5000.fr)

-- Grid'5000 Team 15:00, 14 March 2019 (CET)

Support of Jumbo frames now available

We are pleased to announce that all network equipments of Grid'5000 are now configured to support Jumbo frames (that is, large ethernet frames). We support an MTU (Maximum Transmission Unit) of 9000 bytes everywhere (including between Grid'5000 sites).

By default the reference and standard environments are still configured with a default MTU of 1500, but you can change the configuration (ip link set dev <device> mtu 9000) if needed. The same MTU value works from inside KaVLAN networks.

-- Grid'5000 Team 16:10, 06 March 2019 (CET)

Update about Kwapi status

For some time now there were several issues with kwapi monitoring of energy consumption and network traffic on Grid5000.

After some investigations, we made these actions to fix the problems:

  • Network monitoring has been disabled in Kwapi (servers were overloaded)
  • Code has been optimized and many bugs have been fixed.
  • Reliability problems with measurements made on some Power Delivery Units have been identified and Kwapi has been disabled on clusters where these problems are too important.
  • The time resolution of some PDUs has been updated in the reference API

Some details about PDU measurements issues can be found here: https://www.grid5000.fr/mediawiki/index.php/Power_Monitoring_Devices#measurement_artifacts_and_pitfalls

The current status of Kwapi energy monitoring can be checked here: https://intranet.grid5000.fr/jenkins-status/?job=test_kwapi For every clusters marked "green", Kwapi can be considered to be functional.

Cluster where dedicated monitoring devices are available (Lyon and Grenoble) are still fully fonctional.

-- Grid'5000 Team 10:30, 29 January 2019 (CET)

Grid'5000 is now part of the Fed4FIRE testbeds federation

In the context of the Fed4FIRE+ H2020 project, Grid'5000 joined the Fed4FIRE federation and is now listed as one of its testbeds.

There is still ongoing work in order to allow the use of Grid'5000 resources using Fed4FIRE API and tools.

For more information, see:

-- Grid'5000 Team 10:30, 10 January 2019 (CET)

New clusters available in Grenoble

We have the pleasure to announce that the 2 new clusters in Grenoble are now fully operational:

  • dahu: 32x Dell PE C6420, 2 x Intel Xeon Gold 6130 (Skylake, 2.10GHz, 16 cores), 192 GiB RAM, 240+480 GB SSD + 4.0 TB HDD, 10 Gbps Ethernet + 100 Gbps Omni-Path
  • yeti: 4x Dell PE R940, 4 x Intel Xeon Gold 6130 (Skylake, 2.10GHz, 16 cores), 768 GiB RAM, 480 GB SSD + 2x 1.6 TB NVME SSD + 3x 2.0 TB HDD, 10 Gbps Ethernet + 100 Gbps Omni-Path

These nodes share a Omnipath network with 40 additionnal "dahu"s, operated by GRICAD (Mésocentre HPC of Univ. Grenoble-Alpes) and are equipped with high frequency power monitoring devices (same wattmeters as in Lyon). This equippement was mainly funded by the LECO CPER (FEDER, Région Auvergne-Rhone-Alpes, DRRT, Inria), and the COMUE Univ. Grenoble-Alpes.

-- Grid'5000 Team 15:00, 20 December 2018 (CET)

New environnements available for testing: Centos 7, Ubuntu 18.04, Debian testing

Three new environments are now available, and registered on all sites with Kadeploy: Centos 7 (centos7-x64-min), Ubuntu 18.04 (ubuntu1804-x64-min), Debian testing (debiantesting-x64-min).

They are in a beta state: at this point, we welcome feedback from users about those environments (issues, missing features, etc.). We also welcome feedback on other environments that would be useful for your experiments, and that are currently missing from the set of environments provided on Grid'5000.

Those environments are built with Kameleon <http://kameleon.imag.fr/>. Recipes are available in the 'newrecipes' branch of the environments-recipes git repository: https://github.com/grid5000/environments-recipes/tree/newrecipes

Our testing indicates that those environments work fine on all clusters, except in those cases:

  • Debian testing on granduc (luxembourg) and sagittaire (lyon): deployment fails due to lack of entropy after boot (related to the fix for CVE-2018-1108)
  • Debian testing on chifflot (lille): deployment fails because the predictable naming of network interfaces changed

-- Grid'5000 Team 15:28, 19 December 2018 (CET)

New clusters available in Lille

We have the pleasure to announce that 2 new clusters are available in Lille:

  • chifflot : 8 Dell PE R740 nodes with 2 x Intel Xeon Gold 6126 12C/24T, 192GB DDR4, 2 x 447 GB SSD + 4 x 3.639 TB SAS including
    • chifflot-[1-6] nodes with 2 Nvidia P100
    • chifflot-[7-8] nodes with 2 Nvidia V100
  • chiclet : 8 Dell PE R7425 nodes with 2 x AMD EPYC 7301 16C/32T, 128GB DDR4

These nodes are connected with 25Gb Ethernet to a Cisco Nexus9000 switch (reference: 93180YC-EX)

The extension of the hardware equipment at the Lille's site of Grid'5000 is part of the Data (Advanced data science and technologies) CPER project carried by Inria, with the support of the regional council of Hauts-de-France, FEDER and the State.

-- Grid'5000 Team 16:40, 12 November 2018 (CET)

100 Gbps Omni-Path network now available

Omni-Path networking is now available on Nancy's grele and grimani clusters.

On the software side, support is provided in Grid'5000 environments using packages from the Scibian distribution[0], a Debian-based distribution for high-performance computing started by EDF.

OpenMPI automatically detects and uses Omni-Path when available. To learn more about how to use it, refer to the Run MPI on Grid'5000 tutorial[1].

More new clusters with Omni-Path networks are in the final stages of installation. Stay tuned for updates!

[0] http://www.scibian.org [1] https://www.grid5000.fr/mediawiki/index.php/Run_MPI_On_Grid%275000

-- Grid'5000 Team 15:10, 25 September 2018 (CET)

Change in user's home access

During this week, we will change the access policy of home directories in order to improve the security and the privacy of data stored in your home directory.

This change should be transparent to most users, as:

  • You will still have access to your local user home from reserved nodes when using the standard environment or when deploying custom environments such as 'nfs' or 'big'.
  • You can still access every user's home from frontends (provided the access permissions allow it).

However, in other situations (give access to your home to other users, mount home directory from another site or inside a virtual machine or from a VLAN, ...), you will need to explicitely allow access using the new Storage Manager API. See https://www.grid5000.fr/w/Storage_Manager

Note that this change requires a switch to autofs to mount /home in user environments. This should be transparent in most cases because g5k-postinstall (used by all reference environments) has been modified. However, if you use an old environment, it might require a change either to switch to g5k-postinstall, or to switch to autofs. Contact us if needed.

-- Grid'5000 Team 14:40, 4 September 2018 (CET)

New cluster in Nantes: ecotype

We have the pleasure to announce a new cluster called "ecotype" hosted at the IMT Atlantique campus located in Nantes. It features 48 Dell Powerdege R630 nodes with 2 Intel Xeon E5-2630L v4, 10C/20T, 128GB DDR4, 372GB SSD and 10Gbps Ethernet.

This cluster has been funded by the CPER SeDuCe (Regional concil of the Pays de la Loire, Nantes Metropole, Inria Rennes - Bretagne Atlantique, IMT Atlantique and the French government).

-- Grid'5000 Team 16:30, 29 August 2018 (CET)

New production cluster in Nancy: grvingt

We are happy to announce that a new cluster with 64 compute nodes and 2048 cores is ready in the *production queue* in Nancy !

This is the first Grid'5000 cluster based on the latest generation of Intel CPUs (Skylake).

Each node has:

  • two Intel Xeon Gold 6130 (Skylake, 2.10GHz, 2 CPUs/node, 16 cores/CPU)
  • 192GB of RAM
  • 1TB HDD
  • one 10Gbps Ethernet interface

All nodes are also connected with an Intel Omni-Path 100Gbps network (non-blocking).

This new cluster, named "grvingt"[0], is funded by CPER CyberEntreprises (FEDER, Région Grand Est, DRRT, INRIA, CNRS).

As a reminder the specific rules for the "production" queue are listed on https://www.grid5000.fr/w/Grid5000:UsagePolicy#Rules_for_the_production_queue

[0] Important note regarding pronunciation: despite the Corsican origin, you should pronounce the trailing T as this is how "vingt" is pronounced in Lorraine.

-- Grid'5000 Team 17:30, 25 July 2018 (CET)

Second interface on Lille's clusters

A second 10Gbps interface is now connected on the chetemi and chifflet clusters. Those interfaces are connected to the same switch as the first. KaVLAN is also available.

-- Grid'5000 Team 16:00, 23 July 2018 (CET)

Updated Hardware and Network description

A detailed description of all Grid'5000 resources is now available on the Grid'5000 wiki, and generated on a regular basis from the Reference API, so that it stays up-to-date. (As you probably know, the Grid'5000 reference API provides a full description of Grid'5000 as JSON documents). Check out those pages:

-- Grid'5000 Team 09:00, 2 July 2018 (CET)

Debian 9 environments now use predictable network interfaces names

All our Debian 9 environments have now been modified to use predictable network interfaces names (eno1, ens1, enp2s0, etc. instead of the traditional eth0, eth1, etc.), which is now the default on Debian and most other distributions. You can read more about this standard on https://www.freedesktop.org/wiki/Software/systemd/PredictableNetworkInterfaceNames .

Because Grid'5000 nodes use different hardware in each cluster, the default network interface name varies, depending on whether it is included on the motherboard or an external NIC plugged in a PCI Express slot.

The mapping between old and new names is available on each site's Hardware page, such as https://www.grid5000.fr/mediawiki/index.php/Nancy:Hardware .

Some useful commands:

  • ip link show up : show network interfaces that are up
  • ip -o link show up : show network interfaces that are up (one-line format)
  • ip addr show up : show network addresses for interfaces that are up
  • ip -4 addr show up : show network addresses for interfaces that are up (IPv4 only)
  • ip -o -4 addr show up : show network addresses for interfaces that are up (IPv4 only, one-line format)

If you want to stick with the old behaviour, you can either:

  • use older versions of the environments, available in /grid5000
  • change the environment description (kadeploy .dsc file) and add "--net traditional-names" to the call to g5k-postinstall.

Typically, that would mean:

  1. Get the current environment description:
kaenv3 -p debian9-x64-min > myenv.dsc
  1. edit myenv.dsc, find the "script:" line, and add "--net traditional-names" to make it, for example:
script: g5k-postinstall --net debian --net traditional-names
  1. deploy using: kadeploy3 -a myenv.dsc -m mynode

-- Grid'5000 Team 10:00, 17 May 2018 (CET)

Spring cleanup: removal of wheezy environments, and of HTTP proxies

We recently removed some old cruft.

First, the Debian 7 "wheezy" environments were removed from the Kadeploy database. They remain available on each frontend under /grid5000 for the time being. As a reminder, older environments are archived and still available : see /grid5000/ README.unmaintained-envs for details.

Second, we finally removed the HTTP proxies, that were used in the past to access the outside world from Grid'5000. This removal might cause problems if you use old environments that still rely on HTTP proxies. However, you can simply drop their use.

An upcoming change is the switch to predictable names for network interfaces in our Debian9 environments. More information on this will follow when it is deployed.

-- Grid'5000 Team 11:00, 26 April 2018 (CET)

Looking back at the first Grid'5000-FIT school

The first Grid’5000-FIT school was held in Sophia Antipolis, France from April 3rd to April 6th, 2018.

During three full days, more than 90 researchers (among them 30 PhD and 8 master students) studied jointly the two experimental research infrastructures Grid’5000 and FIT.

In addition to high profile invited speakers and more focused presentations for testbeds users, the program included 30 hours of parallel hands-on sessions. As such, introductory and advanced lab sessions on Grid’5000, FIT IoT-Lab, FIT CorteXlab and FIT R2lab were organized by the testbeds developers, each attracting dozens of attendants. See http://www.silecs.net/1st-grid5000-fit-school/program/ for more details on the presentations and on the hands-on sessions.

The last day, a hackathon involving several participating ad-hoc teams was organized. The hackathon aimed to use heterogeneous testbed resources as a substrate for building an IoT application from collection of IoT-originating data to cloud-based dashboard.

This event is the first public milestone towards the SILECS project that aims at bringing together both infrastructures. See http://www.silecs.net/ for more details on SILECS.

-- Grid'5000 Team 11:00, 12 April 2018 (CET)

1st Grid'5000-FIT school: call for participation

As previously announced, the first Grid'5000-FIT school will be held in Sophia Antipolis, France from April 3rd to April 6th, 2018. This event is a first public milestone in the SILECS project, that aims at bringing together both infrastructures.

The program is now available, and includes a great set of invited talks, of talks from Grid'5000 and FIT users, and tutorials on experimenting using those infrastructures.

Registration is free but mandatory. Please register!

-- Grid'5000 Team 14:00, 9 March 2018 (CET)

Debian 9 environments updated, with Meltdown patches

As you probably know, a serie of vulnerabilities has recently been discovered in x86 processors[1,2]. This update of our environments is the first to include a mitigation patch for it. However, this patch causes an important performance penalty for syscall-intensive workloads (e.g. performing a lot of I/O). We decided to follow Linux kernel defaults, and activate this patch by default, as it seems that we will all have to live with it for the foreseable future.

[1] https://en.wikipedia.org/wiki/Meltdown_(security_vulnerability)
[2] https://meltdownattack.com/

However, if you want to avoid this performance penalty for your experiments, you have two options: 1) Use version 2017122808 of our environments (which does not include the updated kernel) 2) Disable Kernel Page Table Isolation, by adding pti=off to the kernel command line. To do that, after deployment, edit /etc/default/grub, add pti=off to GRUB_CMDLINE_LINUX_DEFAULT, run update-grub, reboot

Many evaluations of the performance penalty are available online[3,4].

[3] https://www.phoronix.com/scan.php?page=article&item=5distros-post-spectre
[4] https://arxiv.org/abs/1801.04329

Here is an instance, measured on the grele cluster, using 'time dd if=/dev/zero of=/dev/zero bs=1 count=20000000', which is very close to a worst-case workload. The results are for one run, but they are fairly stable over time.

With environment version 2017122808:

# uname -srv
Linux 4.9.0-4-amd64 #1 SMP Debian 4.9.65-3+deb9u1 (2017-12-23)
# dpkg -l |grep linux-image
ii  linux-image-4.9.0-4-amd64     4.9.65-3+deb9u1              amd64        Linux 4.9 for 64-bit PCs
ii  linux-image-amd64             4.9+80+deb9u2                amd64        Linux for 64-bit PCs (meta-package)
# time dd if=/dev/zero of=/dev/zero bs=1 count=20000000
20000000 bytes (20 MB, 19 MiB) copied, 4.20149 s, 4.8 MB/s
real        0m4.203s
user        0m1.168s
sys         0m3.032s

With environment 2018020812, PTI enabled (default):

# uname -srv
Linux 4.9.0-5-amd64 #1 SMP Debian 4.9.65-3+deb9u2 (2018-01-04)
# dpkg -l |grep linux-image
ii  linux-image-4.9.0-5-amd64     4.9.65-3+deb9u2              amd64        Linux 4.9 for 64-bit PCs
ii  linux-image-amd64             4.9+80+deb9u3                amd64        Linux for 64-bit PCs (meta-package)
# time dd if=/dev/zero of=/dev/zero bs=1 count=20000000
20000000 bytes (20 MB, 19 MiB) copied, 10.8027 s, 1.9 MB/s
real        0m10.804s
user        0m3.948s
sys         0m6.852s

With environment 2018020812, PTI disabled in grub configuration:

# uname -srv
Linux 4.9.0-5-amd64 #1 SMP Debian 4.9.65-3+deb9u2 (2018-01-04)
root@grele-1:~# cat /proc/cmdline 
BOOT_IMAGE=/boot/vmlinuz-4.9.0-5-amd64 root=UUID=e803852f-b7a8-454b-a7e8-2b9d9a4da06c ro debian-installer=en_US quiet pti=off
# dpkg -l |grep linux-image
ii  linux-image-4.9.0-5-amd64     4.9.65-3+deb9u2              amd64        Linux 4.9 for 64-bit PCs
ii  linux-image-amd64             4.9+80+deb9u3                amd64        Linux for 64-bit PCs (meta-package)
# time dd if=/dev/zero of=/dev/zero bs=1 count=20000000
20000000 bytes (20 MB, 19 MiB) copied, 4.10788 s, 4.9 MB/s
real        0m4.109s
user        0m1.160s
sys         0m2.948s

1st Grid'5000-FIT school: save the date, call for presentations and practical sessions

It is our pleasure to announce the first Grid’5000-FIT school, that will be held in Sophia Antipolis, France from April 3rd to April 6th, 2018. This event is a first public milestone in the SILECS project, that aims at bringing together Grid'5000 and FIT infrastructures.

The program is being polished and will be available soon on http://www.silecs.net/1st-grid5000-fit-school/. The week of work will give priority to practical sessions where you will use the FIT and Grid’5000 platforms and learn how to conduct scientific and rigorous experiments. Moreover, to foster collaboration, we will arrange a special session where you can present your own work (see “call for presentations” below) or propose your own practical session (see “Call for practical session” below).

We would like to stress in this announcement that we hope users will consider presenting their great work during the school, including already published results. The focus of user presentations during the school is not so much on new results but on showcasing great uses of Grid’5000 and/or FIT platforms, singly or together, with a focus on how the results have been derived from the platforms usage.

Registration will be free but mandatory due to the limited number of seats.

Stay tuned for extra details on http://www.silecs.net/1st-grid5000-fit-school/.

-- Grid'5000 Team 10:00, 1 February 2018 (CET)

Disk reservation feature available on chifflet and parasilo

The disk reservation feature is now available on chifflet in Lille and parasilo in Rennes, in addition to grimoire in Nancy, which was already in beta testing phase. Disk reservation enables users to leave large datasets on nodes between experiments. See Disk reservation wiki page.

-- Grid'5000 Team 11:30, 8 January 2018 (CET)

Standard environments upgraded to Debian 9 "stretch"

All Grid'5000 nodes now use Debian 9 "stretch" as standard environment (this is the environment available by default when not deploying).

Known remaining problems at this time are: - Lyon, orion cluster: GPUs are no longer supported by CUDA 9 wich is the version now included in debian 9 environments. - Infiniband support: while we think we have resolved everything, we are not 100% sure at this point (it needs further testing). - Virtual machines on the standard environment don't work anymore (but work if you deploy, of course)

Your own scripts or applications might require changes or recompilation to work on Debian 9. If we want to stick with Debian 8 for the time being, you can still manually deploy the jessie-x64-std environment.

-- Grid'5000 Team 14:30, 15 December 2017 (CET)

Debian 9 "stretch" environments now available for deployments

Kadeploy images are now available for Debian 9.

We used this opportunity to change the naming scheme, so they are named debian9-x64-min, debian9-x64-base, debian9-x64-nfs, debian9-x64-big.

Known regressions and problems are :

  • The Xen environment is not available yet
  • Network interfaces use the traditional naming scheme (eth0, eth1, ...) not yet the new predictable scheme
  • Intel Xeon Phi (for the graphite cluster in Nancy) is not supported yet
  • Infiniband network support is currently not tested

We are planning to switch the "std" environment (the one available when not deploying) in the coming weeks. You can beta-test the environment that we plan to switch to (debian9-x64-std).

Please report problems you encounter with the above environments to support-staff@lists.grid5000.fr.

-- Grid'5000 Team 16:30, 29 November 2017 (CET)

Better Docker support on Grid'5000

Docker support on Grid'5000 has been enhanced.

We now document a set of tools and recommendations to ease Docker usage on Grid'5000.

More details are available on Docker page.

-- Grid'5000 Team 17:30, 28 November 2017 (CET)

5th_Network_Interface_on_grisou_cluster, using a white box switch

A 5th network interface is now connected on grisou¹, providing a 1 Gbps link, as opposed to the first four interfaces that provide 10 Gbps links.

Those interfaces are connected to a Dell S3048 Open Networking switch. Alternative operating systems are available for this switch. While we did not make specific developments to facilitate their use, please talk with us if you are interested in getting control over the switch in order to experiment on alternative software stacks.

¹ https://www.grid5000.fr/mediawiki/index.php/Nancy:Network#Ethernet_1G_on_grisou_nodes

-- Grid'5000 Team 17:00, 8 November 2017 (CET)

Persistent Virtual Machine for Grid'5000 users

We are happy to announce the availability of persistent virtual machines in Grid'5000: each user can now get one virtual machine that will run permanently (outside of OAR reservations). We envision that this will be helpful to host the stable/persistent part of long-running experiments.

More details are available here : https://www.grid5000.fr/mediawiki/index.php/Persistent_Virtual_Machine

-- Grid'5000 Team 10:00, 7 November 2017 (CET)

Grid'5000 Usage Policy updated

The Grid'5000 Usage Policy has been updated.

In a nutshell, all usage that was previously allowed is still allowed after this update.

However, there are several changes to help increase usage of resources that are immediately available.

  • Jobs of duration shorter or equal to one hour, whose submission is done less than 10 minutes before the job starts, are excluded from daily quotas.
    • This means that one can always reserve resources for up to one hour when they are immediately available.
  • Similarly, job extensions requested less than 10 minutes before the end of the job, and for a duration of one hour or less, are also excluded from daily quotas. Those extensions can be renewed several times (always during the last 10 minutes of the job).
    • This means that, when resources are still available, one can always extend jobs for up to one hour.
  • Crossing the 19:00 boundary is allowed for jobs submitted at or after 17:00 the same day. The portion of those jobs from 17:00 to 19:00 is excluded from daily quotas.
    • This means that if at 17:00 or later on a given day, resources are not reserved for the following night, then it is possible to reserve them and start the night job earlier.
  • Crossing the 9:00 boundary is allowed for jobs submitted on the same day. But the portion of those jobs after 9:00 is still included in the daily quota.
    • This means that when resources are free in the morning, people are free to start working earlier.

Note that with those changes, it might become harder to get resources if you do not plan at least one hour ahead.

Another change is that the usage policy now includes rules for disk reservations.

Comments about those new rules are welcomed on the users mailing list.

-- Grid'5000 Team 14:00, 21 September 2017 (CET)

New production cluster in Nancy (grele)

We are glad to announce that a new cluster with 14 compute nodes is ready in the production queue in Nancy ! There are 2 Nvidia GTX 1080Ti per node, providing a total of 100352 GPU cores. They are also equipped with :

  • 2x Intel Xeon E5-2650 v4 (2.2GHz, 12C/24T)
  • 128GB of RAM
  • 2x 300GB SAS HDD 15K rpm

Intel Omni-Path network 100Gbps is not available yet, but should be soon. Note that, Grimani cluster will take advantage of this High speed Low latency network too.

This new High-End cluster, named "Grele", is jointly supported by CPER IT2MP (Innovations Technologiques, Modélisation et Médecine Personnalisée) and CPER LCHN (Langues, Connaissances & Humanités Numériques).

PS : At the moment grele-13 and grele-14 don't have GPUs, but they should arrive really soon ! PS2 : As a reminder, rules for the production queue : https://www.grid5000.fr/mediawiki/index.php/Grid5000:UsagePolicy#Rules_for_the_production_queue

-- Grid'5000 Team 14:00, 10 July 2017 (CET)

Engineer positions available on Grid'5000 !

See the page of open positions.

-- Grid'5000 Team 14:00, 5 July 2017 (CET)

Grid'5000 software list updated

The Software page of the Grid'5000 wiki received a major overhaul. It lists many pieces of software of interest to Grid'5000 users, developed by the Grid'5000 community or the Grid'5000 team.

Please don't hesitate to suggest additions to that list!

-- Grid'5000 Team 13:00, 27 June 2017 (CET)

Grid'5000 services portfolio now available

The Grid'5000 services portfolio (in french: offre de service) is now available. This document (in french only, sorry), prepared by the architects committee together with the technical team, lists all services provided by Grid'5000 to its users. Those services are ranked according to their importance, measured as the resolution delay the team is aiming for when something breaks.

This first version is a good opportunity to provide feedback about missing services, over-inflated priority, under-evaluated priority, etc. Please tell us what is important for your experiments!

-- Grid'5000 Team 14:32, 22 June 2017 (CET)

Nvidia GTX 1080Ti GPUs now available in Lille

We are glad to announce that 16 GPUs have been added to the chifflet cluster. There are 2 Nvidia GTX 1080Ti per node, providing a total of 57344 gpu-cores.

This takes place in the context of the "Advanced data science and technologies" CPER project in which Inria invests, with the support of the regional council of the Hauts de France, the FEDER and the State.

Enjoy,

-- Grid'5000 Team 16:00, 14 June 2017 (CET)

Connect to Grid'5000 with your browser

We are pleased to announce a new feature which we hope will ease the first steps on Grid5000. You can now get a shell through a web interface at this address: https://intranet.grid5000.fr/shell/SITE/ . For example, for the Lille site, use: https://intranet.grid5000.fr/shell/lille/ To connect you will have to type in your credentials twice (first for the HTTP proxy, then for the SSH connection).

-- Grid'5000 Team 11:00, 9 June 2017 (CET)

Debian Stretch minimal image available in beta

Grid'5000 provides a minimal image of Debian Stretch, featuring Linux 4.9, from now on in beta testing. If you use it and encounter problems, please send a report to : support-staff@lists.grid5000.fr

We are also currently working on the other variants of Stretch images (-base, -std, -xen, -big), as well as improvements such as support for predictable network interfaces names. Stay tuned for more information in few weeks.

-- Grid'5000 Team 17:00, 23 May 2017 (CET)

New "OSIRIM" storage space available to Grid'5000 users

A new remote storage is available from frontends and nodes (in default environment or deployed -nfs and -big) under "/srv/osirim/<your_username>" directory.

It provides a persistent storage space of 200GB of storage to every users (using NFS and automount). If needed, more space can be requested by sending an e-mail to support-staff@lists.grid5000.fr.

This storage is kindly provided by the OSIRIM project, from Toulouse IRIT lab.

-- Grid'5000 Team 10:00, 15 May 2017 (CET)

The new disk reservation feature is available for beta testing

In order to leave large datasets on nodes between experiments, a disk reservation service has been developed, and is now available for beta testing on the grimoire cluster in Nancy.

See this page for more details.

The service will be extended to other clusters after the beta testing phase.

-- Grid'5000 Team 15:00, 10 May 2017 (CET)

3rd Network Interface on paranoia cluster

Users can now use the 3rd network interface on cluster paranoia at Rennes. Eth0 and eth1 provide 10G and eth2 a gigabit ethernet connection¹.

We remind you that grisou and grimoire clusters have 4 interfaces 10G in Nancy.

¹ For more details, see Rennes:Network

-- Grid'5000 Team 15:00, 09 May 2017 (CET)

Changing job walltime

The latest OAR version installed in Grid'5000 now provides the functionality to request a change to the walltime of a running job (duration of the resource reservation). Such a change is granted with respect to the resources occupation by other jobs and the resulting job characteristics still have to comply with the Grid'5000 Usage Policy¹. You may use the oarwalltime command or use the Grid'5000 API to request or query the status of such a change for a job. See the Grid'5000 tutorials and API documentation for more information.

¹ Rules from the Usage Policy still apply on the modified job. We are however considering changing the Usage Policy to take the new feature into account, and would welcome ideas toward doing that. Please feel free to make suggestions on devel@lists.grid5000.fr.

-- Grid'5000 Team 10:00, 20 March 2017 (CET)

New cluster "nova" available at Lyon

We have the pleasure to announce that a new cluster called "nova" is available at Lyon. It features 23 Dell R430 nodes with 2 Intel Xeon E5-2620 v4, 8C/16T, 32GB DDR4, 2x300GB SAS and 10Gbps Ethernet. Energy monitoring is available for this cluster, provided by the same devices used for the other clusters in Lyon.

This cluster has been funded by Inria Rhône-Alpes.

-- Grid'5000 Team 10:00, 20 March 2017 (CET)

Grid'5000 events in your calendaring application

Grid'5000 events, such as maintenance operations or service disruptions are published on the events page (https://www.grid5000.fr/status/). You see the summary of this page on Grid'5000's homepage, but could also get information through a RSS feed. Since last week, events are also published as an ical feed ([2]), to which you can subscribe most calendaring applications such as Zimbra or Thunderbird's Lightning extension.

--Grid'5000 Team 13:00, 10 March 2017 (CET)

New clusters available in Lille

We have the pleasure to announce that new clusters are available on Lille's site :

  • chetemi in reference to "Bienvenue chez les ch'tis" : 15 R630 nodes with 2 Xeon E5-2630 v4, 20C/40T, 256GB DDR4, 2x300GB SAS
  • chifflet : 8 R730 nodes with 2 Xeon E5-2680 v4, 28C/56T, 768GB DDR4, 2x400GB SAS SSD, 2x4TB SATA

These nodes are connected with 10Gb Ethernet to a Cisco Nexus9000 switch (reference: 93180YC-EX)

The renewal of the Lille's site take place in the context of the CPER "Advanced data science and technologies" project in which Inria invest, with the support of the regional council of the Hauts de France, the FEDER and the State.

-- Grid'5000 Team 15:00, 02 March 2017 (CET)

Standard environment updated (v2017011315)

This week, we updated the standard environment to a new version (2017011315), this one includes a ganglia module to monitor NVIDIA GPUs, you can now watch GPU usage on ganglia webpage (For example : Ganglia graphique-1 )

We also updated other variants :

  • Ceph installation has been moved from [base] to [big]
  • libguestfs-tools has been removed from [big]
  • CUDA version has been updated to 8.0.44 [big]
  • Nvidia Driver has been updated to 375.26 [big]
  • linux-tools package has been installed from [big]

-- Grid'5000 Team 15:00, 16 January 2017 (CET)


Grid'5000 News now available as RSS

Because mails to the users mailing list are lost to users joining Grid'5000, the technical team will now share Grid'5000 news using a blog like format. A RSS link is available from the News page, at this URL.

-- Grid'5000 Team 15:00, 20 December 2016 (CET)


Deployments API changes

The Deployments API available in the "sid" or "4.0" has slightly changed: It now uses the native Kadeploy API. See Deployments API page for more information.

-- Grid'5000 Team 15:11, 25 October 2016 (CET)


Grid'5000 winter school now finished

The Grid'5000 spring school took place between February 2nd, 2016 and February 5th, 2016 in Grenoble. Two awards were given for presentation entries:

  • Best presentation award to Houssem-Eddine Chihoub and Christine Collet for their work on A scalability comparaison study of smart meter data management approaches (slides)
  • Most promising experiment to Anna Giannakou for her work Towards Self Adaptable Security Monitoring in IaaS Clouds (slides)

-- Grid'5000 Team 15:11, 25 February 2016 (CET)


10 year later, the Grid'5000 school returns to Grenoble

10 year after the first edition of the Grid'5000 school, we invite the Grid'5000 community to take part in the Grid'5000 winter School 2016, February 2-5, 2016.

-- Grid'5000 Team 15:11, 25 December 2015 (CET)


Grid'5000 co-organizing the SUCCES 2015 event

These will take place November 5-6, 2015, in Paris. Dedicated to users of grid, cloud or regional HPC centers, the SUCCES 2015 days are organised by France Grilles, Grid'5000, Groupe Calcul and GDR ASR.

-- Grid'5000 Team 15:11, 25 October 2015 (UTC)


<endFeed />


See also older news.