Disk reservation: Difference between revisions
Line 159: | Line 159: | ||
lrwxrwxrwx 1 root root 7 Oct 13 09:14 /dev/disk5 -> nvme1n1 | lrwxrwxrwx 1 root root 7 Oct 13 09:14 /dev/disk5 -> nvme1n1 | ||
It is possible to display disks with their PCI path, which is guaranteed to always be the same (unless the hardware is physically modified): | It is also possible to display disks with their PCI path, which is guaranteed to always be the same (unless the hardware is physically modified): | ||
# <code class="command">ls -l /dev/disk/by-path/</code> | # <code class="command">ls -l /dev/disk/by-path/</code> |
Revision as of 10:18, 13 October 2021
Note | |
---|---|
This page is actively maintained by the Grid'5000 team. If you encounter problems, please report them (see the Support page). Additionally, as it is a wiki page, you are free to make minor corrections yourself if needed. If you would like to suggest a more fundamental change, please contact the Grid'5000 team. |
Disk reservation consists in reserving on nodes additional hard disks, which are otherwise not usable.
The table below shows the Grid'5000 clusters with such additional hard disks available for reservation.
Site | Cluster | Number of nodes | Number of reservable disks per node |
---|---|---|---|
Grenoble | yeti | 4 | 3 |
Lille | chiclet-8 | 1 | 1 |
Lille | chiclet-[1-7] | 7 | 2 |
Lille | chifflot | 8 | 5 |
Lyon | gemini | 2 | 4 |
Nancy | gros | 124 | 1 |
Nancy | grouille | 2 | 1 |
Rennes | parasilo | 27 | 5 |
Last generated from the Grid'5000 Reference API on 2024-11-05 (commit 359267a37d)
How it works
Two use cases of the disk reservsation are possible:
- Long run reservations of disks only (job reserving no host, i.e. no processing power)
- disk-only reservations do not have to fit in the day vs. night&week-end host reservation policy, and can last up to many days (see Grid5000:UsagePolicy). The reserved disks can then be used by regular host jobs during the period of time of the disk reservation. In this use case, the goal is to get more persistence for the local storage of nodes, e.g. avoid the need to reformat disks and reimport dataset in each regular host job. Those long run jobs must use the
noop
OAR job type. - Regular jobs reserving both host and disks
- In this use case, the goal is to get access to the reservable disks within the experiment, just as if the disk were not to reserve separately.
In both cases, making use of the reserved disks requires to gain the root privileges, since disks are provided as bare metal hardware to be partitioned, formated, mounted, filled with no restriction but by the experimenter. As a result, the experimenter can use the reserved disk:
- either in a non-deploy job, in the standard environment but after enabling sudo with the
sudo-g5k
command ; - or in a deploy job, in a kadeployed environment (use the
deploy
OAR job type, thenkadeploy
).
Technically speaking, when a deploy job starts, or whenever sudo-g5k
is called in a non-deploy job, the reserved disks stay available (shown by lsblk
) while the other disks are disabled and disappear.
Warning | |
---|---|
Mind that some disks may show up in
|
Reserved disks can only be accessed by the user who reserved them.
Please note that reserved disks are not cleaned-up at the end of reservation. As a result:
- Data let on the disks can be accessed by user in later reservations.
- Reserved disk may first need to get cleaned-up before use (remove previous formating and partitioning)
See also Security issues.
Usage
The main commands to reserve disks are given below.
The maximum duration of a disk reservation is defined in the Usage Policy.
Note | |
---|---|
In the following example, add |
Reserve disks and nodes at the same time
- How to reserve a node with only the main disk (none of the additional disks), on the grimoire cluster
(no change to the way a node was to be reserved in the past, before the disk reservation mechanism existed.)
- How to reserve a node with all its disks
- How to reserve nodes grimoire-1 and grimoire-2 with one reservable disk per node
fnancy :
|
oarsub -I -p "host in ('grimoire-1.nancy.grid5000.fr','grimoire-2.nancy.grid5000.fr')" -l /host=2+{"type='disk'"}/host=2/disk=1 |
Note | |
---|---|
Yes, the syntax of the last oarsub command is a bit awkward, so please be careful and mind having:
|
Reserve disks and nodes separately
You may, for example, decide to reserve some disks for one week, but the nodes where your disks are located only when you want to carry out an experiment.
First: reserve the disks
Since we want to reserve disks only in a first time, we use the noop job type: with this noop job type, OAR will not try to execute anything on the job resources (which is what we want since disk resources are not capable of executing programs).
(Please mind that Jobs of type noop cannot be interactive: oarsub
-I -t noop ...
is not supported.)
3 examples:
Reserve two disks on grimoire-1 for one week, starting on 2018-01-01:
fnancy :
|
oarsub -r "2018-01-01 00:00:00" -t noop -l {"type='disk' and host='grimoire-1.nancy.grid5000.fr'"}/host=1/disk=2,walltime=168 |
Or reserve the first two reservable disks on grimoire-2 (named disk1 and disk2, since disk0 is the system disk which is not reservable):
fnancy :
|
oarsub -r "2018-01-01 00:00:00" -t noop -l {"type='disk' and host='grimoire-2.nancy.grid5000.fr' and disk in ('disk1.grimoire-2', 'disk2.grimoire-2')"}/host=1/disk=2,walltime=168 |
Or reserve all disks on two nodes:
fnancy :
|
oarsub -r "2018-01-01 00:00:00" -t noop -l {"type='disk' and cluster='grimoire'"}/host=2/disk=ALL,walltime=168 |
Second: reserve the nodes
You can then reserve nodes grimoire-1 and grimoire-2 for 3 hours, in the usual way:
fnancy :
|
oarsub -I -l {"host in ('grimoire-1.nancy.grid5000.fr', 'grimoire-2.nancy.grid5000.fr')"}/host=2,walltime=3 |
You must respect this order : reserve the disks first, then reserve the nodes. Otherwise the disks you reserved will not be available on your nodes.
Checking the state of reserved disks
Gantt diagrams with disk reservations
Reservations of both nodes (processors) and disks are displayed on the following Gantt diagrams:
Getting information about disk reservations from OAR and G5K APIs
- The OAR API shows the properties of each resource of a job. You can retrieve the properties of your reserved disks, such as disk or diskpath:
fnancy :
|
curl https://api.grid5000.fr/3.0/sites/ site /internal/oarapi/jobs/ job_id /resources.json (or resources.yaml ) |
- The Grid'5000 API also provide some details about disk reservations under the "disks" key of the status and jobs APIs:
Using local disks once connected on the nodes
Login as root on a node where you reserved one or more disks:
- either use
sudo-g5k -i
from the standard environment to become root - either login with SSH as root on an environment you deployed
→ Example of oarsub
command for such a reservation:
fnancy :
|
oarsub -t exotic -l "{type='disk' and host like 'yeti-1.%' and disk like 'disk2.%'}"/disk=1+"{type='default' and host like 'yeti-1.%'}"/host=1 -I |
(then ssh yeti-1
and run sudo-g5k
).
All examples below assume that you are already logged in as root on the node.
Discovering available disks
The lsblk
command lists all block devices. For instance, on a yeti
machine in Grenoble, this might show:
# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 447.1G 0 disk
├─sda1 8:1 0 3.7G 0 part [SWAP]
├─sda2 8:2 0 19.6G 0 part /
├─sda3 8:3 0 22.4G 0 part
├─sda4 8:4 0 1K 0 part
└─sda5 8:5 0 401.5G 0 part /tmp
sdc 8:32 0 1.8T 0 disk
nvme0n1 259:0 0 1.5T 0 disk
nvme1n1 259:1 0 1.5T 0 disk
In this case:
disk0
is shown assda
and is the system disk, so it is always availabledisk2
is shown assdc
and has been reserved explicitly so it is visibledisk1
anddisk3
that should map tosdb
andsdd
do not show up: indeed, they have not been reserved for this exampledisk4
anddisk5
are shown asnvme0n1
andnvme1n1
, that are NVMe SSDs and are always available (not reservable)
You can compare the output with the reference data shown in Grenoble:Hardware#yeti.
If using an environment where the disk aliases are activated (default environment or deployed environment where g5k-postinstall
is called with the --disk-aliases
option), the following symlinks should show the disks with the matching between the diskN
and sdX
names:
# ls -l /dev/disk[0-9]* lrwxrwxrwx 1 root root 3 Oct 13 09:14 /dev/disk0 -> sda lrwxrwxrwx 1 root root 4 Oct 13 09:14 /dev/disk0p1 -> sda1 lrwxrwxrwx 1 root root 4 Oct 13 09:14 /dev/disk0p2 -> sda2 lrwxrwxrwx 1 root root 4 Oct 13 09:15 /dev/disk0p3 -> sda3 lrwxrwxrwx 1 root root 4 Oct 13 09:14 /dev/disk0p4 -> sda4 lrwxrwxrwx 1 root root 4 Oct 13 09:14 /dev/disk0p5 -> sda5 lrwxrwxrwx 1 root root 3 Oct 13 09:14 /dev/disk2 -> sdc lrwxrwxrwx 1 root root 7 Oct 13 09:14 /dev/disk4 -> nvme0n1 lrwxrwxrwx 1 root root 9 Oct 13 09:14 /dev/disk4p1 -> nvme0n1p1 lrwxrwxrwx 1 root root 9 Oct 13 09:14 /dev/disk4p2 -> nvme0n1p2 lrwxrwxrwx 1 root root 7 Oct 13 09:14 /dev/disk5 -> nvme1n1
It is also possible to display disks with their PCI path, which is guaranteed to always be the same (unless the hardware is physically modified):
# ls -l /dev/disk/by-path/
total 0
lrwxrwxrwx 1 root root 9 Oct 7 20:11 pci-0000:18:00.0-scsi-0:0:0:0 -> ../../sda
lrwxrwxrwx 1 root root 10 Oct 7 20:11 pci-0000:18:00.0-scsi-0:0:0:0-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 Oct 7 20:11 pci-0000:18:00.0-scsi-0:0:0:0-part2 -> ../../sda2
lrwxrwxrwx 1 root root 10 Oct 7 20:12 pci-0000:18:00.0-scsi-0:0:0:0-part3 -> ../../sda3
lrwxrwxrwx 1 root root 10 Oct 7 20:11 pci-0000:18:00.0-scsi-0:0:0:0-part4 -> ../../sda4
lrwxrwxrwx 1 root root 10 Oct 7 20:11 pci-0000:18:00.0-scsi-0:0:0:0-part5 -> ../../sda5
lrwxrwxrwx 1 root root 9 Oct 7 20:11 pci-0000:18:00.0-scsi-0:0:2:0 -> ../../sdc
lrwxrwxrwx 1 root root 13 Oct 7 20:11 pci-0000:59:00.0-nvme-1 -> ../../nvme0n1
lrwxrwxrwx 1 root root 13 Oct 7 20:11 pci-0000:6d:00.0-nvme-1 -> ../../nvm10n1
Here, we see that sdc
has the PCI path pci-0000:18:00.0-scsi-0:0:2:0
, which matches the second reservable disk listed on Grenoble:Hardware#yeti.
Partitioning a disk
To start using the disk, you will likely need to partition it. Several tools exist to do this: fdisk
, sfdisk
, cfdisk
, parted
...
For example, to partition the second 2 TB disk of a yeti machine interactively:
Use the interactive prompt to create a single partition of type "Linux filesystem", possibly by deleting existing partitions first.
As an advanced usage, you could use LVM to create logical volumes that may span several disks, or mdadm to create software RAID volumes.
Creating a filesystem
Continuing the previous example, let's create an ext4 filesystem on the first partition of the same disk:
Mount it and check that it appears:
As an advanced usage, you may use any filesystem: Btrfs, HDFS, Ceph, ZFS, Beegfs, etc. Refer to the documentation for each of these systems for guidance.
Troubleshooting
When partitioning or formatting local disks, you might encounter an error such as:
Error: Partition(s) on /dev/sdb are being used
This may be because the disks already contained partitions of a certain type (LVM, software RAID...) from a previous job, and your system automatically started using it. To solve this, you have several options:
- use a tool such as
wipefs
orpvremove
to remove previous information from the disk.
- use a low-level tool such as
dd
to completely erase the beginning of the disk, and reboot. Use with care as it can destroy your data.
For instance, here is an example script that cleans up disks automatically: https://github.com/pmorillon/terraform-provider-grid5000/blob/master/examples/ceph/modules/rook_ceph/files/disk-format.sh.tmpl
Security issues
The mechanism used to enable/disable disks is designed to avoid mistakes from other users. However, a malicious user could take control of the RAID card, enable any disk, and access or erase your data. Please notify the Grid'5000 tech-team in case of such event, but first of all mind securing your data:
- Keep a copy (backup) in a safe place if relevant for your data ;
- If your data is sensitive, mind using cryptographic mechanisms to secure it.
Also, the data on reserved disks is not automatically erased at the end of your job. If you don't want the next user to access it, you have to erase it yourself.
Finally, no backup of data stored on the reserved disks is made.