Grid5000:Gotchas: Difference between revisions
(12 intermediate revisions by 5 users not shown) | |||
Line 1: | Line 1: | ||
{{Maintainer|Lucas Nussbaum}} | {{Maintainer|Lucas Nussbaum}} | ||
{{Portal|User}} | {{Portal|User}} | ||
{{Note|text=For a more | {{Note|text=For a more up to-date list of Gotchas, see https://www.grid5000.fr/status/artifact/}} | ||
This page documents various [http://en.wikipedia.org/wiki/Gotcha_(programming) ''gotchas''] (counter-intuitive features of Grid'5000) that could affect users' experiments in surprising ways. | This page documents various [http://en.wikipedia.org/wiki/Gotcha_(programming) ''gotchas''] (counter-intuitive features of Grid'5000) that could affect users' experiments in surprising ways. | ||
== Network == | == Network == | ||
Global and per sites network documentation can be found on [[Grid5000:Network]] page. | |||
=== Topology of ethernet networks === | === Topology of ethernet networks === | ||
Most (large) clusters have a hierarchical ethernet topology, because ethernet switchs with a large number of ports are too expensive. A good example of such a hierarchical topology is the [[Rennes:Network]] for the paravance and parasilo clusters, where nodes are connected to 3 different switches. When doing experiments using the ethernet network intensively, it is a good idea to request nodes on the same switch, using e.g <tt>oarsub -l switch=1/nodes=5</tt>, or to request nodes connected to specific switch using e.g <tt>oarsub -p "switch='cisco2'" -l nodes=5</tt>. | Most (large) clusters have a hierarchical ethernet topology, because ethernet switchs with a large number of ports are too expensive. A good example of such a hierarchical topology is the [[Rennes:Network]] for the paravance and parasilo clusters, where nodes are connected to 3 different switches. When doing experiments using the ethernet network intensively, it is a good idea to request nodes on the same switch, using e.g <tt>oarsub -l switch=1/nodes=5</tt>, or to request nodes connected to specific switch using e.g <tt>oarsub -p "switch='cisco2'" -l nodes=5</tt>. | ||
Line 13: | Line 17: | ||
=== High-performance networks === | === High-performance networks === | ||
The topology of Infiniband and | The topology of Infiniband and Omni-Path networks is generally less surprising, two "fat-tree" topologies can be found on the testbed: | ||
* the | * non-blocking (''1:1'') : the number of up-link ports (from leaf switches to top switches) is equal to the number of down-link ports (nodes to leaf switches). Like that, all the nodes can communicate with each others at full-speed. | ||
* blocking (''2:1''): the number of up-link ports (from leaf switches to top switches) is half the number of down-link ports (nodes to leaf switches). Like that, nodes from the same leaf switch can communicate to each other at full speed, but not with nodes from others leaf switches. | |||
== Compute nodes == | == Compute nodes == | ||
All Grid'5000 clusters are supposed to contain homogeneous (identical) sets of nodes, but there are some exceptions. | All Grid'5000 clusters are supposed to contain homogeneous (identical) sets of nodes, but there are some exceptions. | ||
=== Hard disks === | |||
Global and per sites cluster documentation can be found on [[Hardware]] page. | |||
=== Hard disks models === | |||
Due to their high failure rate, hard disks tend to get replaced frequently, and it is not always possible to keep the same model during the whole life of a cluster. If this is important to you, please check exact disk model using the reference API, as storage is described in detail for each node. | Due to their high failure rate, hard disks tend to get replaced frequently, and it is not always possible to keep the same model during the whole life of a cluster. If this is important to you, please check exact disk model using the reference API, as storage is described in detail for each node. | ||
=== | === NVMe disks configuration === | ||
Due to many issues, we had to disable "multipath" support for NVMe disks in most of our environments. This is done by passing the "multipath=off" parameter to the <code>nvme_core</code> module and is done in <code class="command">g5k-postinstall</code>. | |||
If you need NVMe multipath support, you can deploy any <code>-min</code> environment (e.g. <code>debian11-x64-min</code>) since they do not contain this workaround. | |||
In such environment, the following limitations apply: | |||
=== | * <code class="file">/dev/disk/by-path/</code> entries are not created | ||
* <code class="file">/dev/diskX</code> disk aliases are not created, see below | |||
If you need to deploy a <code>-min</code> environment and want working disk aliases, you can use the following workaround on the running system to disable multipath support: | |||
<pre> | |||
# rmmod nvme nvme_core | |||
# modprobe nvme_core multipath=off | |||
# modprobe nvme | |||
# sleep 10 | |||
</pre> | |||
=== Missing disk aliases === | |||
We provide disk aliases such as <code class="file">/dev/diskX</code> on nodes, as documented in hardware pages (e.g. [[Grenoble:Hardware#dahu]]). | |||
However, for some configurations, we cannot provide these disk aliases: | |||
{| class="wikitable" | |||
|- | |||
! scope="col"| Cluster | |||
! scope="col"| Environment | |||
! scope="col"| Missing disk aliases | |||
! scope="col"| Reason | |||
|- | |||
| [[Grenoble:Hardware#drac|drac]] | |||
| <code class="env">centos7-ppc64-min</code> | |||
| All disks | |||
| Non-standard PCI paths | |||
|- | |||
| [[Grenoble:Hardware#troll|troll]] | |||
| <code class="env">*-x64-min</code> | |||
| NVMe disks | |||
| "NVMe multipath" conflicts with udev | |||
|- | |||
| [[Grenoble:Hardware#yeti|yeti-1]] | |||
| <code class="env">*-x64-min</code> | |||
| <code class="file">/dev/disk4</code> | |||
| "NVMe multipath" conflicts with udev | |||
|- | |||
| [[Grenoble:Hardware#servan|servan]] | |||
| <code class="env">*-x64-min</code> | |||
| NVMe disks | |||
| "NVMe multipath" conflicts with udev | |||
|} | |||
== Software == | == Software == | ||
* The standard environment (the one users get when not deploying) on all compute nodes is identical, with the exception of additional drivers and software to support GPUs | * The standard environment (the one users get when not deploying) on all compute nodes is identical for a given architecture (x86-64, arm64 or ppc64), with the exception of additional drivers and software to support GPUs and High Speed networks on sites where they are available. | ||
* The user frontend are identical on all sites. | * The user frontend are identical on all sites. | ||
* The reference environments ( | * The reference environments (*-$arch-{min,base,nfs,big}) are identical on all sites, for a given architecture. | ||
Regarding CPU architectures, some differences can be found in environments: | |||
{| class="wikitable" | |||
|- | |||
! scope="col"| Feature | |||
! scope="col"| x86-64 | |||
! scope="col"| arm64 | |||
! scope="col"| ppc64 | |||
! scope="col"| env | |||
|- | |||
! scope="row"| Infiniband | |||
| {{Yes}} | |||
| {{Yes}} | |||
| {{Yes}} | |||
| ''base'' | |||
|- | |||
! scope="row"| OmniPath | |||
| {{Yes}} | |||
| {{No}} | |||
| {{No}} | |||
| ''base'' | |||
|- | |||
! scope="row"| NFS | |||
| {{Yes}} | |||
| {{Yes}} | |||
| {{Yes}} | |||
| ''nfs'' | |||
|- | |||
! scope="row"| Ceph | |||
| {{Yes}} | |||
| {{Yes}} | |||
| {{Yes}} | |||
| ''nfs'' | |||
|- | |||
! scope="row"| Cuda | |||
| {{Yes}} | |||
| {{No}} | |||
| {{Yes}} | |||
| ''big'' | |||
|- | |||
! scope="row" | BeegFS | |||
| {{Yes}} | |||
| {{No}} | |||
| {{No}} | |||
| ''big'' | |||
|- | |||
! scope="row" | OpenMPI | |||
| {{Yes}} | |||
| {{Yes}} | |||
| {{Yes}} | |||
| ''big'' | |||
|} |
Latest revision as of 16:52, 23 August 2023
Note | |
---|---|
For a more up to-date list of Gotchas, see https://www.grid5000.fr/status/artifact/ |
This page documents various gotchas (counter-intuitive features of Grid'5000) that could affect users' experiments in surprising ways.
Network
Global and per sites network documentation can be found on Grid5000:Network page.
Topology of ethernet networks
Most (large) clusters have a hierarchical ethernet topology, because ethernet switchs with a large number of ports are too expensive. A good example of such a hierarchical topology is the Rennes:Network for the paravance and parasilo clusters, where nodes are connected to 3 different switches. When doing experiments using the ethernet network intensively, it is a good idea to request nodes on the same switch, using e.g oarsub -l switch=1/nodes=5, or to request nodes connected to specific switch using e.g oarsub -p "switch='cisco2'" -l nodes=5.
Performance of ethernet networks
The backplane bandwidth of ethernet switches doesn't usually allow full-speed communications between all the ports of the switch.
High-performance networks
The topology of Infiniband and Omni-Path networks is generally less surprising, two "fat-tree" topologies can be found on the testbed:
- non-blocking (1:1) : the number of up-link ports (from leaf switches to top switches) is equal to the number of down-link ports (nodes to leaf switches). Like that, all the nodes can communicate with each others at full-speed.
- blocking (2:1): the number of up-link ports (from leaf switches to top switches) is half the number of down-link ports (nodes to leaf switches). Like that, nodes from the same leaf switch can communicate to each other at full speed, but not with nodes from others leaf switches.
Compute nodes
All Grid'5000 clusters are supposed to contain homogeneous (identical) sets of nodes, but there are some exceptions.
Global and per sites cluster documentation can be found on Hardware page.
Hard disks models
Due to their high failure rate, hard disks tend to get replaced frequently, and it is not always possible to keep the same model during the whole life of a cluster. If this is important to you, please check exact disk model using the reference API, as storage is described in detail for each node.
NVMe disks configuration
Due to many issues, we had to disable "multipath" support for NVMe disks in most of our environments. This is done by passing the "multipath=off" parameter to the nvme_core
module and is done in g5k-postinstall
.
If you need NVMe multipath support, you can deploy any -min
environment (e.g. debian11-x64-min
) since they do not contain this workaround.
In such environment, the following limitations apply:
/dev/disk/by-path/
entries are not created/dev/diskX
disk aliases are not created, see below
If you need to deploy a -min
environment and want working disk aliases, you can use the following workaround on the running system to disable multipath support:
# rmmod nvme nvme_core # modprobe nvme_core multipath=off # modprobe nvme # sleep 10
Missing disk aliases
We provide disk aliases such as /dev/diskX
on nodes, as documented in hardware pages (e.g. Grenoble:Hardware#dahu).
However, for some configurations, we cannot provide these disk aliases:
Cluster | Environment | Missing disk aliases | Reason |
---|---|---|---|
drac | centos7-ppc64-min
|
All disks | Non-standard PCI paths |
troll | *-x64-min
|
NVMe disks | "NVMe multipath" conflicts with udev |
yeti-1 | *-x64-min
|
/dev/disk4
|
"NVMe multipath" conflicts with udev |
servan | *-x64-min
|
NVMe disks | "NVMe multipath" conflicts with udev |
Software
- The standard environment (the one users get when not deploying) on all compute nodes is identical for a given architecture (x86-64, arm64 or ppc64), with the exception of additional drivers and software to support GPUs and High Speed networks on sites where they are available.
- The user frontend are identical on all sites.
- The reference environments (*-$arch-{min,base,nfs,big}) are identical on all sites, for a given architecture.
Regarding CPU architectures, some differences can be found in environments:
Feature | x86-64 | arm64 | ppc64 | env |
---|---|---|---|---|
Infiniband | base | |||
OmniPath | base | |||
NFS | nfs | |||
Ceph | nfs | |||
Cuda | big | |||
BeegFS | big | |||
OpenMPI | big |