PMEM: Difference between revisions

From Grid5000
Jump to navigation Jump to search
No edit summary
No edit summary
(39 intermediate revisions by 3 users not shown)
Line 1: Line 1:
Some nodes of Grid'5000 feature the new [[Hardware#PMEM_size_per_node|'''Persistent Memory''']] technology. As of writing this page, the [[Grenoble:Hardware#troll|troll]] cluster in Grenoble is equipped.
{{Portal|User}}
{{Portal|Tutorial}}
{{TutorialHeader}}
__TOC__
Some [[Hardware#PMEM_size_per_node|nodes of Grid'5000]] feature the new '''Persistent Memory''' technology. As of writing this page, the [[Grenoble:Hardware#troll|troll]] cluster in Grenoble is equipped.


= Forewords =
= Forewords =
This ''Persistant Memory'' technology is known by many different names, e.g.
This ''Persistent Memory'' technology is known by many different names, e.g.
* nvdimm (generic term, nvdimm-N with a battery, nvdimm-P...)
* nvdimm (generic term, nvdimm-N = battery backed DRAM, nvdimm-P...)
* SCM (storage class memory)
* SCM (storage class memory)
* PMM/PMEM
* PMM/PMEM
In the rest of this document, we'll use the PMEM acronym.
In the rest of this document, we'll use the '''PMEM''' acronym.


The current available PMEM technology available in Grid'5000 is ''Intel's Optane DC Permanent Memory''. Other vendors may provide PMEM in the future (IBM, HPE Memristor ?). PMEM has been also available for tests in emulator such as qemu for a long time.
The current available PMEM technology available in Grid'5000 is '''Intel's Optane DC Persistent Memory'''. Other vendors may provide PMEM in the future (IBM, HPE Memristor ?). PMEM has been also available for tests in emulators such as qemu for a long time.


This technology consists in DIMMs (just like DRAM) but offering a different set of characteristics:
This technology consists in DIMMs (just like DRAM) but offering a different set of characteristics:
* It fills the gap between memory and storage: RAM <x10< PMEM <x100< NVMe SSD in terms of latency
* It fills the gap between memory and storage: RAM <x10< PMEM <x100< high-end NVMe SSD in terms of latency
* Persistence: can be use as (persistent) memory or filesystem on steroids
* Persistence: can be used as (persistent) memory or filesystem on steroids
* Byte addressable, zero-copy memory mapping
* Byte addressable, zero-copy memory mapping
* No energy consumption on idle (and more than RAM when used)
* No energy consumption when idle, but more than RAM when used
* Lower price per GB compared to DRAM, larger memory sizes than DRAM
* Lower price per GB compared to DRAM, larger memory sizes than DRAM
This technology is not to be confused with the generic NVRAM term or the NVMe storages (SSD disk drives on top of PCIe).


= Intel's PMEM settings =
= Intel's PMEM settings =
Intel's PMEM can be configured in 2 modes:
Intel's PMEM can be configured in 2 modes:
; Memory
; Memory
* Just more RAM, no persistence. DRAM serves as cache (disappear for the system viewpoint).
* Just more RAM, no persistence. DRAM serves as cache (it disappears for the operating system viewpoint).
; App direct
; App direct
* Many choices of configuration:  
* Many choices of configuration:  
Line 27: Line 33:
** sector, fsdax, devdax, kmem (kmem not available before Linux 5.1)
** sector, fsdax, devdax, kmem (kmem not available before Linux 5.1)
; Mix mode
; Mix mode
* It is also possible to allocate part of the memory to Memory Mode and part of it to App Direct
* It is also possible to allocate part of the memory to Memory mode and part of it to App Direct


In order to change the configuration (e.g. from Memory mode to App Direct mode, or vice versa), a reboot of the machine is needed.
'''In order to change the configuration (e.g. from Memory mode to App Direct mode, or vice versa), a reboot of the machine is needed'''.


== Grid'5000 setup for experimentation ==
= Grid'5000 setup for experimentation =
The choice in Grid'5000 as been to configure PMEM in Memory Mode by default. That means that the PMEM is in Memory Mode (appear just like RAM) in the Grid'5000 default environment (when not deploying).
'''The choice in Grid'5000 has been to configure PMEM in Memory mode by default'''.
That means that the PMEM is in Memory mode (it appears just like more RAM) in the Grid'5000 default environment (when not deploying).


We encourage users who wants to experiments with the App direct mode to deploy a very recent system (e.g. Debian testing), in order to benefit from the latest support for PMEM.
'''Kadeploying allows to experiment with the App Direct mode'''. We encourage users who wants to experiments with the App direct mode to deploy a very recent system (e.g. Debian testing), in order to benefit from the latest support for PMEM.


To that purpose, jobs need to be of the ''deploy'' type, and kadeploy must be used:
To that purpose, jobs need to be of the ''deploy'' type, and kadeploy must be used:
{{Term|location=fgrenoble|cmd=<code class="command">oarsub -p "cluster='troll'" -t deploy -I</code>}}
{{Term|location=fgrenoble|cmd=<code class="command">oarsub -p "cluster='troll'" -t deploy -I</code>}}
Then:
Then:
{{Term|location=fgrenoble|cmd=<code class="command">kadeploy -e debian-testing -f $OAR_NODEFILE -k</code>}}
{{Term|location=fgrenoble|cmd=<code class="command">kadeploy3 -e debiantesting-x64-min -f $OAR_NODEFILE -k</code>}}
 
Once a node is deployed, one can connect to it as root, install the PMEM software and possibly change the configuration and reboot to apply it.
 
The PMEM software are:
* <code class="command">ipmctl</code>: tool to change the config of Intel's PMEM (switch mode, etc.)
* <code class="command">ndctl</code>: tool to configure PMEM when in App Direct mode
* <code class="command">daxctl</code>: tool to configure the PMEM direct access (dax)
Install in Debian testing as follows:
{{Term|location=troll-2|cmd=<code class="command">apt install ipmctl ndctl daxctl</code>}}
 
See the man pages or external documentations (see the [[#References|references]] section) of the tools to find out how to use them.


Once a node is deployed, once can connect as root, install the PMEM software and possibly change the configuration and reboot to apply it. The PMEM software are:
For instance to change to App Direct mode, with DIMMs interleaved, one can run:
* ipmctl: tool to change the config of Intel's PMEM
{{Term|location=troll-2|cmd=<code class="command">ipmctl create -goal MemoryMode=0</code>}}
* ndctl: tool to configure PMEM in App Direct mode
And then '''reboot'''.
* daxctl: tool to configure the PMEM direct access (dax)
{{Term|location=troll-2|cmd=<code class="command">apt install ipmctl ndctl daxtool</code>}}


See the man pages or [[#References|external documentations]] of the tools to use them.
Reboot time of the machine is pretty long (~ 10 minutes), so be patient. You might want to look at the console to follow the progress:
{{Term|location=fgrenoble|cmd=<code class="command">kaconsole3 -m</code><code classe="replace">troll-2</code>}}


Please mind that '''when a job is terminated, the nodes of the job are automatically reconfigured to the default mode of operation, that is Memory Mode'''.
; Important notes
* Please mind that '''when a job is terminated, the nodes of the job are automatically reconfigured to the default mode of operation, that is Memory mode'''.
* Please mind that '''sudo-g5k is of NO help to experiment with the App Direct mode''', since rebooting the node after changing the configuration will terminate the job, and switch it back to Memory mode. Using the App Direct mode requires kadeploying.


= References =
= References =
Line 66: Line 85:
* VirtIO-PMEM https://www.phoronix.com/scan.php?page=news_item&px=Linux-5.3-VirtIO-PMEM
* VirtIO-PMEM https://www.phoronix.com/scan.php?page=news_item&px=Linux-5.3-VirtIO-PMEM
* https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-a00074717en_us&docLocale=en_US
* https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-a00074717en_us&docLocale=en_US
= Setup for exprimentation =
== Regular access in the std environment (no root) ==
* what mode by default ?
** easier: Memory mode
*** provide 1.5TB of RAM to users
*** will attract users not for the primary goal of the technology (experimenting the new technology), just to have a huge memory space
*** not reasonnably switchable as App Direct with sudo-g5k (see below)
** but the technolgie was primarily bought for the App Direct mode.
** mix mode does not seem to bring an interesting thread-off.
* if App Direct: what configuration ?
** allow to choose or change mode ? (fsdax, sector, devdax, kmem...)
*** In job submission command ?
*** with a wrapper command to run in the job ?
** how to share down to the core resource level ? How to split/share ?
*** 6 nvdimms / 16 cores per CPU, 2 CPUs.
*** can be interleaving or dedicated at the CPU level
*** see 4.2 of https://hal.inria.fr/hal-02173336/document
*** kmem (pmem configured as an extra numa memory node), but not supported in Buster (linux 4.19)
*** create namespace with size in the same ratio as the reserved part of the machine/cpu ? seems complex for the added value.
== Sudo g5k ==
* not useful if Memory Mode
** any change from Memory Mode needs reboot → mostly the end of the job (unless not on the 1st node, but would it be reasonnable ?)
* limited support for App Direct
** Buster's kernel too old for kmem
** latest kernel wished (> 5.x)
** cannot change the regions (interleave, etc)
* clean-up/reset needed after the job
== Full access (root), entire node ==
* users to deploy debiantesting in order to get the latest functionalities (5.3 now in Debian testing, currently 5.2 in the env, update will be welcomed)
* allow switching mode (Memory <-> App direct) during the kadeploy phase (in postinstall + force reboot / no kexec ?).
* give access to the tools (ipmctl, ndctl, daxctl) directly in the debiantesting env ?
* no need for a wrapper
* users to reboot to switch mode or regions setting
* clean-up/reset needed after the job
** recover from mode switch or regions configuration
** or App Direct setups: dropping namespaces to reset to a default config with 2 empty regions → may need reboot to delete namespaces
** → where to do it ?
= How to reset the pmem config =
=== From linux using ipmctl/ndctl ===
requires to boot a linux with the tools, then to reboot again...
; ipmctl commands:
* to set the config back to the wanted mode (Memory, App Direct, App Direct w/o interleave...) → changing mode or region always requires a reboot
$ ipmctl create -goal MemoryMode=100 # Memory Mode
or
$ ipmctl create -goal MemoryMode=0 # App Direct, DIMMs interleaved
or
$ ipmctl create -goal MemoryMode=0 PersistentMemoryType=AppDirectNotInterleaved # App Direct, DIMMs not interleaved
* to remove any namespaces if App Direct → deleting namespace may require a reboot
=== With the idrac ===
* create a job queue to reconfigure → will reboot the system many times also, but we do not have to handle it...
* The commands is (but it does not show up in the completion):
racadm>>set BIOS.PmCreateGoalConfig.PmPersistentPercentage PmPercent100
racadm set BIOS.PmCreateGoalConfig.PmPersistentPercentage PmPercent100 
[Key=BIOS.Setup.1-1#PmCreateGoalConfig]
RAC1017: Successfully modified the object value and the change is in
        pending state.
        To apply modified value, create a configuration job and reboot
        the system. To create the commit and reboot jobs, use "jobqueue"
        command. For more information about the "jobqueue" command, see RACADM
        help.
racadm>>jobqueue create BIOS.Setup.1-1 -r pwrcycle -s TIME_NOW
racadm jobqueue create BIOS.Setup.1-1 -r pwrcycle -s TIME_NOW 
RAC1024: Successfully scheduled a job.
Verify the job status using "racadm jobqueue view -i JID_xxxxx" command.
Commit JID = JID_762830527675
Reboot JID = RID_762830527722
* it's a very long process, which requires a first reboot to run the job queue to change the bios setting, then a second reboot to apply the change and a third reboot with the change...
* bizarrely, when trying to get the current state of the machine with a get on BIOS.PmCreateGoalConfig.PmPersistentPercentage, it continues to report ''PmPercentNone'' which means Memory mode, although we jut swtich back to App Direct...
* but we can get the current state with:
racadm>>get BIOS.IntelPersistentMemorySettingsMainMenu.PmMemoryCapacity
racadm get BIOS.IntelPersistentMemorySettingsMainMenu.PmMemoryCapacity
[Key=BIOS.Setup.1-1#IntelPersistentMemorySettingsMainMenu]
PmMemoryCapacity=0 B
* or
racadm>>get BIOS.IntelPersistentMemorySettingsMainMenu.PmAppDirectCapacity
racadm get BIOS.IntelPersistentMemorySettingsMainMenu.PmAppDirectCapacity
[Key=BIOS.Setup.1-1#IntelPersistentMemorySettingsMainMenu]
PmAppDirectCapacity=1.4 TB
UPDATE: ca marche plus !
=== Reset mechanic proposals ===
; Using the linux commands
* (a) in the '''kadeploy kernel''':
** reset could be triggered in the epilogue of jobs (after deploy or sudo-g5k)
** it would require to always boot the deploy kernel (not always the case after deploy ? → force destructive mode ?)
** and force a reboot after fixing the pmem conf (kexec would not let the change take effect, needs a real reboot).
* (b) in the '''g5k-check run'''
** after booting (e.g. if we do nothing in the deploy kernel, or when no redeploy happened) to the std env, g5k-checks could look at the pmem configuration:
** if wrong it could fix it with the linux commands and trigger a new reboot instead of setting the node to Alive
** that way it is then correct in the next boot (timeout to be configured with regard to that possible extra reboot)
(both methods could be implemented at a same time, to optimize a bit, but the latter seems sufficient)
; Using idrac
* it could be done '''in the job epilogue in place of the simple reboot''', but it is longer than doing it from linux because it actually needs 3 reboots.
; Timings:
* simple reboot/netboot: 6 min
* reboot/netboot + reset pmem from linux + reboot/netboot: 12 min
* reset pmem from idrac + rebootrebootreboot/netboot: 14 min
; Phoenix
Another option is to just let g5k-ckeck make the node become suspected (because incorrect configuration), and let phoenix fix the node somehow.
But avoiding a solution where the nodes is in the suspected state in its regular functioning seems worth to limit overhead, admins burden with a specialty for that cluster, and users trying to understand the OAR states of the nodes when looking at the gantt for instance.

Revision as of 12:42, 9 July 2020

Note.png Note

This page is actively maintained by the Grid'5000 team. If you encounter problems, please report them (see the Support page). Additionally, as it is a wiki page, you are free to make minor corrections yourself if needed. If you would like to suggest a more fundamental change, please contact the Grid'5000 team.

Some nodes of Grid'5000 feature the new Persistent Memory technology. As of writing this page, the troll cluster in Grenoble is equipped.

Forewords

This Persistent Memory technology is known by many different names, e.g.

  • nvdimm (generic term, nvdimm-N = battery backed DRAM, nvdimm-P...)
  • SCM (storage class memory)
  • PMM/PMEM

In the rest of this document, we'll use the PMEM acronym.

The current available PMEM technology available in Grid'5000 is Intel's Optane DC Persistent Memory. Other vendors may provide PMEM in the future (IBM, HPE Memristor ?). PMEM has been also available for tests in emulators such as qemu for a long time.

This technology consists in DIMMs (just like DRAM) but offering a different set of characteristics:

  • It fills the gap between memory and storage: RAM <x10< PMEM <x100< high-end NVMe SSD in terms of latency
  • Persistence: can be used as (persistent) memory or filesystem on steroids
  • Byte addressable, zero-copy memory mapping
  • No energy consumption when idle, but more than RAM when used
  • Lower price per GB compared to DRAM, larger memory sizes than DRAM

This technology is not to be confused with the generic NVRAM term or the NVMe storages (SSD disk drives on top of PCIe).

Intel's PMEM settings

Intel's PMEM can be configured in 2 modes:

Memory
  • Just more RAM, no persistence. DRAM serves as cache (it disappears for the operating system viewpoint).
App direct
  • Many choices of configuration:
    • DIMMs interleave option in the region (change needs reboot)
    • region splits in namespaces (change may need reboot)
    • sector, fsdax, devdax, kmem (kmem not available before Linux 5.1)
Mix mode
  • It is also possible to allocate part of the memory to Memory mode and part of it to App Direct

In order to change the configuration (e.g. from Memory mode to App Direct mode, or vice versa), a reboot of the machine is needed.

Grid'5000 setup for experimentation

The choice in Grid'5000 has been to configure PMEM in Memory mode by default. That means that the PMEM is in Memory mode (it appears just like more RAM) in the Grid'5000 default environment (when not deploying).

Kadeploying allows to experiment with the App Direct mode. We encourage users who wants to experiments with the App direct mode to deploy a very recent system (e.g. Debian testing), in order to benefit from the latest support for PMEM.

To that purpose, jobs need to be of the deploy type, and kadeploy must be used:

Terminal.png fgrenoble:
oarsub -p "cluster='troll'" -t deploy -I

Then:

Terminal.png fgrenoble:
kadeploy3 -e debiantesting-x64-min -f $OAR_NODEFILE -k

Once a node is deployed, one can connect to it as root, install the PMEM software and possibly change the configuration and reboot to apply it.

The PMEM software are:

  • ipmctl: tool to change the config of Intel's PMEM (switch mode, etc.)
  • ndctl: tool to configure PMEM when in App Direct mode
  • daxctl: tool to configure the PMEM direct access (dax)

Install in Debian testing as follows:

Terminal.png troll-2:
apt install ipmctl ndctl daxctl

See the man pages or external documentations (see the references section) of the tools to find out how to use them.

For instance to change to App Direct mode, with DIMMs interleaved, one can run:

Terminal.png troll-2:
ipmctl create -goal MemoryMode=0

And then reboot.

Reboot time of the machine is pretty long (~ 10 minutes), so be patient. You might want to look at the console to follow the progress:

Terminal.png fgrenoble:
kaconsole3 -mtroll-2
Important notes
  • Please mind that when a job is terminated, the nodes of the job are automatically reconfigured to the default mode of operation, that is Memory mode.
  • Please mind that sudo-g5k is of NO help to experiment with the App Direct mode, since rebooting the node after changing the configuration will terminate the job, and switch it back to Memory mode. Using the App Direct mode requires kadeploying.

References