Monitoring Using Kwollect
Note | |
---|---|
This page is actively maintained by the Grid'5000 team. If you encounter problems, please report them (see the Support page). Additionally, as it is a wiki page, you are free to make minor corrections yourself if needed. If you would like to suggest a more fundamental change, please contact the Grid'5000 team. |
This page describes the monitoring service available in Grid’5000 based on Kwollect to retrieve environmental and performance metrics from nodes.
The service currently provides metrics for:
- Energy consumption from dedicated “wattmetre” devices (currently available for some clusters in Lyon, Grenoble, Nancy)
- Metrics collected from nodes’ Board Management Controller ("BMC": out-of-band management hardware, such as Dell iDRAC), such as ambient temperature, hardware component temperature, energy consumption from PSU, fan speed, etc.
- Traffic collected from network devices
- Energy consumption from PDU, when available
- Node metrics from Prometheus node exporter (and Nvidia DCGM exporter when GPU is available)
Metrics available
The list of metrics available for a given Grid’5000 cluster is described in the Reference API, under the “metrics” entry of the cluster description. For instance, to get the list of available metrics for nodes of taurus cluster, you can use (API requests must be performed from inside Grid'5000 or need to supply authentication credentials):
$ curl https://api.grid5000.fr/stable/sites/lyon/clusters/taurus | jq .metrics
This returns a list where each element describes a single metric. Most important fields of that description are:
- name: The name identifying the metric
- description: A human-readable description of the metric
- labels (optional): A “label” that will be used to distinguish two metrics of the same kind, but targeting different objects (e.g. temperature of CPU #1 vs temperature of CPU #2)
- period and optional_period: The interval (in milli-seconds) under which the metric is collected. The former is the default interval, the latter is the interval used when the metric is activated “on demand” (see below). A metric with period value at 0 is not collected by default.
- only_for (optional): When the metric is not available on all nodes of the cluster, the list of nodes where it is available
The full list of metrics is available at the end of this page.
Getting metrics values
The metrics values are stored by Kwollect and available using the Metrology API by performing a GET request with appropriate parameters at URL:
https://api.grid5000.fr/stable/sites/<site>/metrics
For instance, to:
- get all metrics collected inside job 1304978 at Lyon:
curl 'https://api.grid5000.fr/stable/sites/lyon/metrics?job_id=1304978'
- get all metrics from chifflot-5 and chifflot-6 between 2021-06-08 15:00 and 2021-06-08 17:00:
curl 'https://api.grid5000.fr/stable/sites/lille/metrics?nodes=chifflot-5,chifflot-6&start_time=2021-06-08T15:00&end_time=2021-06-08T17:00'
- get all values from Wattmetre for taurus-10, during last 15 minutes:
curl "https://api.grid5000.fr/stable/sites/lyon/metrics?nodes=taurus-10&metrics=wattmetre_power_watt&start_time=$(date -d '15 min ago' +%s)"
The request will return a JSON-formatted list of metrics values and their associated timestamp.
For a complete description of parameters and returned fields, see the API specification at: https://api.grid5000.fr/doc/stable/#tag/metrics
Important note: To avoid overloading Kwollect servers, request size is limited. When too much metrics values are requested, you may hit that limit and receive an error message. In such case, try to reduce the size of your request by selecting less nodes or a shorter time period.
Tip: If you need metrics formatted as CSV, you can use this command line :
curl <URL_to_request_some_metrics> | jq -r '.[] | [.timestamp, .device_id, .metric_id, .value, .labels|tostring] | @csv'
On-demand metrics
Some metrics are not collected by default and must be activated “on-demand”. These metrics have a period field equal to 0 in their description (see above).
“On-demand” can be enabled for specific jobs by adding the “-t monitor=<metric_to_enable>” option to oarsub. E.g:
$ oarsub -I -t monitor=bmc_cpu_temp_celsius
The provided argument can be a regular expression For instance, to enable all metrics related to temperature:
$ oarsub -I -t monitor='.*temp.*'
To enable all “On-demand” metrics available, the -t monitor
option can be used.
Note:
- Dedicated wattmetre devices (metric “wattmetre_power_watt”) are able to perform one measurement every 20 milli-seconds. However, this high frequency is only provided using “on demand” activation. For instance, submit your job using:
$ oarsub -I -t monitor='wattmetre_power_watt'
By default, only the value averaged over one second is provided.
- As prometheus metrics depend on node characteristics, they cannot be fully described. Only a subset of prometheus metrics will be collected by default (described in the API by the prom_default_metrics and prom_nvgpu_default_metrics, when relevant). To enable collecting all prometheus metrics, use "on-demand" activation on prom_all_metrics or prom_nvgpu_all_metrics (for instance, use
monitor='prom_.*'
)
- Monitoring of default set of prometheus metrics (prom_default_metrics) is enabled for job running the standard environment. For deployed nodes, prometheus monitoring must be activated "on-demand"
Visualization Dashboard
A visualization dashboard based on Grafana is available. Metrics can be displayed by job ID or by date and graphics can be updated in real time with new values.
Dashboards are available at:
https://api.grid5000.fr/stable/sites/<site>/metrics/dashboard
- Grenoble: https://api.grid5000.fr/stable/sites/grenoble/metrics/dashboard
- Lille: https://api.grid5000.fr/stable/sites/lille/metrics/dashboard
- Luxembourg: https://api.grid5000.fr/stable/sites/luxembourg/metrics/dashboard
- Lyon: https://api.grid5000.fr/stable/sites/lyon/metrics/dashboard
- Nancy: https://api.grid5000.fr/stable/sites/nancy/metrics/dashboard
- Nantes: https://api.grid5000.fr/stable/sites/nantes/metrics/dashboard
- Rennes: https://api.grid5000.fr/stable/sites/rennes/metrics/dashboard
- Sophia: https://api.grid5000.fr/stable/sites/sophia/metrics/dashboard
- Strasbourg: https://api.grid5000.fr/stable/sites/strasbourg/metrics/dashboard
- Toulouse: https://api.grid5000.fr/stable/sites/toulouse/metrics/dashboard
Notes:
- If dashboard's time frame is longer than 30 minutes, some "summarized" values (averaged over 5 minutes) will be displayed instead of the actual values.
- Metrics whose name end by "_total" are displayed as a "per second changing rate".
- When filling a job number, the dashboard's displayed time frame may not be adjusted automatically to the job's begin and end date.
- The list of devices and metrics is retrieved from what's available at the end of the displayed time frame when the dashboard is loaded. If the device or metric you are looking for does not appear, be sure to adjust the time frame to a period where your device or metric exists and force refreshing the lists by reloading the web page from your browser.
Pushing custom metrics
A simple mechanism is available to let you push your own, arbitrary, custom metrics to Kwollect (internally it uses Prometheus Node Exporter "Textfile Collector"). From a node, a custom metric will be collected by writing to a specific file (you currently need root privileges to do so):
$ echo 'kwollect_custom{_metric_id="my_metric"} 42' > /var/lib/prometheus/node-exporter/kwollect.prom
This will push a custom metric named my_metric and with value "42" (this file may contain several lines to push different values at a time).
The associated timestamp will be the time when Kwollect fetches metrics on the node (every 15 seconds under Grid'5000). You may override the timestamp by adding _timestamp="your_timestamp_in_unix_seconds". For instance:
$ echo 'kwollect_custom{_metric_id="my_metric", _timestamp="1606389005.1234"} 42' > /var/lib/prometheus/node-exporter/kwollect.prom
(You can also add your own labels. See Prometheus metrics naming and format for more information).
Kwollect will continuously pull values from /var/lib/prometheus/node-exporter/kwollect.prom
file. It is your responsibility to keep its content updated (for instance, only one value for a particular metric should be present in this file at anytime).
Known problems
- Metrics from BMC are quite unreliable. They may be inaccurate, highly averaged, or unavailable on some nodes from time to time.
- Some metrics are not available for every nodes of a cluster: For grcinq and hercule clusters, one every four nodes have more metrics available than the others ; on parasilo and paravance clusters, the measurement of the power consumption by PDUs is only available on some nodes, wattmeters are only available on the gros-[41-76] nodes. Such metric have and "only_for" entry in their description indicating nodes where it is available.
- It may happen that few values of a metric are not collected quickly enough to comply with interval described in the Reference API (for instance, when the targeted device is overloaded).
- Electrical consumption reported by PDU is not always reliable, see Power_Monitoring_Devices#measurement_artifacts_and_pitfalls
Metrics available in Grid'5000
Metrics marked with * must be activated on demand, and metrics marked with ** are activated on non-deploy jobs by default. Clusters marked with ⁺ do not have metric available on all its nodes
Metric Name | Description | Available on |
---|---|---|
bmc_ambient_temp_celsius* | XXXemperature reported by BMC, in celsius | grenoble: dahu, drac, servan, troll, yeti lille: chiclet, chifflot, chirop, chuc luxembourg: petitprince lyon: gemini, neowise, nova, orion, pyxis, sirius, taurus nancy: graffiti, grappe, grat, grele, gres, gros, grosminet, grostiti, grouille, grue, gruss, grvingt nantes: econome, ecotype rennes: abacus16, abacus25, paradoxe, parasilo, roazhon4 strasbourg: fleckenstein toulouse: montcalm |
bmc_cpu_power_watt* | Power XXX reported by BMC, in watt | grenoble: drac lyon: pyxis, sirius toulouse: montcalm |
bmc_cpu_temp_celsius* | Temperature of CXXX reported by BMC, in celsius | grenoble: dahu, drac, servan, troll, yeti lille: chiclet, chifflot, chirop, chuc luxembourg: petitprince lyon: gemini, neowise, nova, orion, pyxis, sirius, taurus nancy: graffiti, grappe, grat, grele, gres, gros, grosminet, grouille, grue, gruss, grvingt nantes: econome, ecotype rennes: abacus16, abacus25, paradoxe, parasilo, roazhon4 strasbourg: fleckenstein toulouse: montcalm |
bmc_cpu_usage_percent* | Usage of CPU reported by BMC, in percent | toulouse: montcalm |
bmc_dimm_power_watt | Power of DIMM reported by BMC, in watt | toulouse: montcalm |
bmc_dimm_temp_celsius* | Temperature of XXX reported by BMC, in celsius | grenoble: drac lille: chirop, chuc lyon: gemini, neowise, pyxis nancy: grat, gres, grosminet rennes: abacus25, paradoxe strasbourg: fleckenstein toulouse: montcalm |
bmc_exhaust_temp_celsius* | Temperature XXX reported by BMC, in celsius | grenoble: servan, troll, yeti lille: chiclet, chifflot lyon: orion, taurus nancy: grappe, grele, gros, grostiti, grouille, grue, gruss nantes: ecotype rennes: abacus16, parasilo, roazhon4 |
bmc_fan_power_watt* | Power consumption of Fan reported by BMC, in watt | grenoble: drac |
bmc_fan_speed_rpm* | Speed of XXX reported by BMC, in rpm | grenoble: dahu, drac, servan, troll, yeti lille: chiclet, chifflot lyon: gemini, neowise, nova, orion, pyxis, sirius, taurus nancy: graffiti, grappe, grele, gros, grouille, grue, gruss, grvingt nantes: econome, ecotype rennes: abacus16, parasilo, roazhon4 |
bmc_fan_usage_percent* | Usage of Fan XXX reported by BMC, in percent | lille: chirop, chuc nancy: grat, gres rennes: abacus25, paradoxe toulouse: montcalm |
bmc_gpu_power_watt | Power consumption of GPU XXX by BMC, in watt | grenoble: drac lyon: gemini, sirius |
bmc_gpu_temp_celsius* | Temperature of GXXX reported by BMC, in celsius | grenoble: drac lille: chifflot lyon: gemini, neowise, sirius nancy: grat, grouille, grue, gruss rennes: abacus25 |
bmc_mem_power_watt* | Power consumption of Mem ProcXXX reported by BMC, in watt | grenoble: drac |
bmc_node_power_watt* | Power XXX reported by BMC, in watt | grenoble: dahu, drac, servan, troll, yeti lille: chiclet, chifflot, chirop luxembourg: petitprince lyon: gemini, neowise, orion, pyxis, sirius, taurus nancy: graffiti, grappe, grat, grdix, grele, gres, gros, grostiti, grouille, grue, gruss, grvingt nantes: ecotype rennes: abacus16, abacus25, paradoxe, parasilo, roazhon4 strasbourg: fleckenstein toulouse: montcalm |
bmc_node_power_watthour_total* | Cumulated power consumption of node reported by BMC, in wattXXX | grenoble: dahu, servan, troll, yeti lille: chiclet, chifflot luxembourg: petitprince lyon: orion, taurus nancy: graffiti, grappe, grele, gros, grouille, grue, gruss, grvingt nantes: ecotype rennes: abacus16, parasilo, roazhon4 |
bmc_other_airflow_cfm | Airflow reported by BMC, in CFM | lyon: gemini |
bmc_other_current_amp* | Current of XXX reported by BMC, in amp | grenoble: drac lille: chirop, chuc lyon: neowise, pyxis nancy: grat, grdix, gres, grostiti nantes: econome rennes: abacus25, paradoxe |
bmc_other_power_watt* | Power consumption of XXX reported by BMC, in watt | grenoble: drac lille: chirop, chuc lyon: gemini, sirius nancy: grat, grdix, gres rennes: abacus25, paradoxe |
bmc_other_speed_rpm* | Speed of FanXXX reported by BMC, in rpm | nancy: grostiti |
bmc_other_temp_celsius* | Temperature of XXX reported by BMC, in celsius | grenoble: drac lille: chirop, chuc lyon: gemini, neowise, pyxis, sirius nancy: grat, grdix, gres, grosminet, grostiti nantes: econome rennes: abacus25, paradoxe strasbourg: fleckenstein toulouse: montcalm |
bmc_other_usage_percent* | Usage of CPU Utilization reported by BMC, in percent | lille: chuc |
bmc_other_voltage_volt* | Voltage of XXX reported by BMC, in volt | grenoble: drac lille: chirop, chuc lyon: gemini, neowise, pyxis, sirius nancy: grat, grdix, gres, grostiti nantes: econome rennes: abacus25, paradoxe |
bmc_psu_current_amp* | Current of PSU XXX reported by BMC, in amp | grenoble: dahu, servan, troll, yeti lille: chiclet, chifflot lyon: orion, taurus nancy: graffiti, grappe, grele, gros, grouille, grue, gruss, grvingt nantes: ecotype rennes: abacus16, parasilo, roazhon4 |
bmc_psu_power_watt* | Power XXX reported by BMC, in watt | lille: chirop, chuc lyon: gemini, sirius nancy: grat, grdix, gres rennes: abacus25, paradoxe toulouse: montcalm |
bmc_psu_temp_celsius* | Temperature of PSU XXX reported by BMC, in celsius | lyon: sirius toulouse: montcalm |
bmc_psu_voltage_volt* | Voltage of PSU XXX reported by BMC, in volt | grenoble: dahu, servan, troll, yeti lille: chiclet, chifflot lyon: orion, taurus nancy: graffiti, grappe, grele, gros, grouille, grue, gruss, grvingt nantes: ecotype rennes: abacus16, parasilo, roazhon4 |
network_ifacein_bytes_total | Input byte counter for the network device port | grenoble: dahu, drac, servan, troll, yeti, gw lille: chiclet, chifflot, sw-chiclet-1, sw-chiclet-2, sw-chiclet-3 luxembourg: petitprince, gw, gw-kirchberg, mxl1, mxl2, sw-b04, sw-b09, ul-grid5000-sw02 lyon: gemini, hercule, neowise, nova, orion, pyxis, sagittaire, sirius, taurus, force10, gw, salome nancy: graffiti, grappe, grat, grdix, grele, gres, gros, grosminet, grostiti, grouille, grue, gruss, grvingt, gw, gw-next, sgrappe, sgrdix, sgros1, sgros2, sgruss, sgrvingt nantes: econome, ecotype, ecotype-prod1, ecotype-prod2, gw rennes: parasilo, parasilo-sw-1 sophia: uvb, gw |
network_ifacein_packets_discard_total | Input counter of discarded packets for the network device port | grenoble: dahu, drac, servan, troll, yeti, gw lille: chiclet, chifflot, sw-chiclet-1, sw-chiclet-2, sw-chiclet-3 luxembourg: petitprince, gw, gw-kirchberg, mxl1, mxl2, sw-b04, sw-b09, ul-grid5000-sw02 lyon: gemini, hercule, neowise, nova, orion, pyxis, sagittaire, sirius, taurus, force10, gw, salome nancy: graffiti, grappe, grat, grdix, grele, gres, gros, grosminet, grostiti, grouille, grue, gruss, grvingt, gw, gw-next, sgrappe, sgrdix, sgros1, sgros2, sgruss, sgrvingt nantes: econome, ecotype, ecotype-prod1, ecotype-prod2, gw rennes: parasilo, parasilo-sw-1 sophia: uvb, gw |
network_ifacein_packets_error_total | Input counter of packet errors for the network device port | grenoble: dahu, drac, servan, troll, yeti, gw lille: chiclet, chifflot, sw-chiclet-1, sw-chiclet-2, sw-chiclet-3 luxembourg: petitprince, gw, gw-kirchberg, mxl1, mxl2, sw-b04, sw-b09, ul-grid5000-sw02 lyon: gemini, hercule, neowise, nova, orion, pyxis, sagittaire, sirius, taurus, force10, gw, salome nancy: graffiti, grappe, grat, grdix, grele, gres, gros, grosminet, grostiti, grouille, grue, gruss, grvingt, gw, gw-next, sgrappe, sgrdix, sgros1, sgros2, sgruss, sgrvingt nantes: econome, ecotype, ecotype-prod1, ecotype-prod2, gw rennes: parasilo, parasilo-sw-1 sophia: uvb, gw |
network_ifacein_packets_total | Input packet counter for the network device port | grenoble: dahu, drac, servan, troll, yeti, gw lille: chiclet, chifflot, sw-chiclet-1, sw-chiclet-2, sw-chiclet-3 luxembourg: petitprince, gw, gw-kirchberg, mxl1, mxl2, sw-b04, sw-b09, ul-grid5000-sw02 lyon: gemini, hercule, neowise, nova, orion, pyxis, sagittaire, sirius, taurus, force10, gw, salome nancy: graffiti, grappe, grat, grdix, grele, gres, gros, grosminet, grostiti, grouille, grue, gruss, grvingt, gw, gw-next, sgrappe, sgrdix, sgros1, sgros2, sgruss, sgrvingt nantes: econome, ecotype, ecotype-prod1, ecotype-prod2, gw rennes: parasilo, parasilo-sw-1 sophia: uvb, gw |
network_ifaceout_bytes_total | Output byte counter for the network device port | grenoble: dahu, drac, servan, troll, yeti, gw lille: chiclet, chifflot, sw-chiclet-1, sw-chiclet-2, sw-chiclet-3 luxembourg: petitprince, gw, gw-kirchberg, mxl1, mxl2, sw-b04, sw-b09, ul-grid5000-sw02 lyon: gemini, hercule, neowise, nova, orion, pyxis, sagittaire, sirius, taurus, force10, gw, salome nancy: graffiti, grappe, grat, grdix, grele, gres, gros, grosminet, grostiti, grouille, grue, gruss, grvingt, gw, gw-next, sgrappe, sgrdix, sgros1, sgros2, sgruss, sgrvingt nantes: econome, ecotype, ecotype-prod1, ecotype-prod2, gw rennes: parasilo, parasilo-sw-1 sophia: uvb, gw |
network_ifaceout_packets_discard_total | Output counter of discarded packets for the network device port | grenoble: dahu, drac, servan, troll, yeti, gw lille: chiclet, chifflot, sw-chiclet-1, sw-chiclet-2, sw-chiclet-3 luxembourg: petitprince, gw, gw-kirchberg, mxl1, mxl2, sw-b04, sw-b09, ul-grid5000-sw02 lyon: gemini, hercule, neowise, nova, orion, pyxis, sagittaire, sirius, taurus, force10, gw, salome nancy: graffiti, grappe, grat, grdix, grele, gres, gros, grosminet, grostiti, grouille, grue, gruss, grvingt, gw, gw-next, sgrappe, sgrdix, sgros1, sgros2, sgruss, sgrvingt nantes: econome, ecotype, ecotype-prod1, ecotype-prod2, gw rennes: parasilo, parasilo-sw-1 sophia: uvb, gw |
network_ifaceout_packets_error_total | Output counter of packet errors for the network device port | grenoble: dahu, drac, servan, troll, yeti, gw lille: chiclet, chifflot, sw-chiclet-1, sw-chiclet-2, sw-chiclet-3 luxembourg: petitprince, gw, gw-kirchberg, mxl1, mxl2, sw-b04, sw-b09, ul-grid5000-sw02 lyon: gemini, hercule, neowise, nova, orion, pyxis, sagittaire, sirius, taurus, force10, gw, salome nancy: graffiti, grappe, grat, grdix, grele, gres, gros, grosminet, grostiti, grouille, grue, gruss, grvingt, gw, gw-next, sgrappe, sgrdix, sgros1, sgros2, sgruss, sgrvingt nantes: econome, ecotype, ecotype-prod1, ecotype-prod2, gw rennes: parasilo, parasilo-sw-1 sophia: uvb, gw |
network_ifaceout_packets_total | Output packet counter for the network device port | grenoble: dahu, drac, servan, troll, yeti, gw lille: chiclet, chifflot, sw-chiclet-1, sw-chiclet-2, sw-chiclet-3 luxembourg: petitprince, gw, gw-kirchberg, mxl1, mxl2, sw-b04, sw-b09, ul-grid5000-sw02 lyon: gemini, hercule, neowise, nova, orion, pyxis, sagittaire, sirius, taurus, force10, gw, salome nancy: graffiti, grappe, grat, grdix, grele, gres, gros, grosminet, grostiti, grouille, grue, gruss, grvingt, gw, gw-next, sgrappe, sgrdix, sgros1, sgros2, sgruss, sgrvingt nantes: econome, ecotype, ecotype-prod1, ecotype-prod2, gw rennes: parasilo, parasilo-sw-1 sophia: uvb, gw |
pdu_outlet_power_watt | Power consumption XXX reported by PDU, in watt | lille: pdu-b1p1, pdu-b1p2, pdu-b1p3, pdu-b1p3-2, pdu-b2p1, pdu-b2p2 nancy: gros, grimani-pdu1, grimani-pdu2, grimoire-pdu1, grimoire-pdu2, gros-pdu1, gros-pdu2, gros-pdu3, gros-pdu4, gros-pdu5, gros-pdu6, gros-pdu7, gros-pdu8 rennes: parasilo⁺, parasilo-pdu-2, parasilo-pdu-3, parasilo-pdu-4 |
prom_all_metrics* | All metrics from Prometheus Node Exporter | grenoble: dahu, drac, servan, troll, yeti lille: chiclet, chifflot, chirop, chuc luxembourg: petitprince lyon: gemini, hercule, neowise, nova, orion, pyxis, sagittaire, sirius, taurus nancy: graffiti, grappe, grat, grdix, grele, gres, gros, grosminet, grostiti, grouille, grue, gruss, grvingt nantes: econome, ecotype rennes: abacus1, abacus10, abacus11, abacus12, abacus14, abacus16, abacus17, abacus18, abacus19, abacus2, abacus20, abacus21, abacus22, abacus25, abacus3, abacus4, abacus5, abacus8, abacus9, paradoxe, parasilo, roazhon1, roazhon10, roazhon11, roazhon12, roazhon13, roazhon2, roazhon3, roazhon4, roazhon5, roazhon6, roazhon7, roazhon8, roazhon9 sophia: uvb strasbourg: fleckenstein toulouse: montcalm |
prom_default_metrics** | Default subset of metrics from Prometheus Node Exporter: kwollect_custom, node_boot_time_seconds, node_cpu_scaling_frequency_hertz, node_cpu_seconds_total, node_filesystem_free_bytes, node_filesystem_size_bytes, node_load1, node_load15, node_load5, node_memory_Buffers_bytes, node_memory_Cached_bytes, node_memory_MemAvailable_bytes, node_memory_MemFree_bytes, node_memory_MemTotal_bytes, node_memory_Shmem_bytes, node_memory_SwapFree_bytes, node_memory_SwapTotal_bytes, node_network_receive_bytes_total, node_network_receive_packets_total, node_network_transmit_bytes_total, node_network_transmit_packets_total, node_procs_blocked, node_procs_running |
grenoble: dahu, drac, servan, troll, yeti lille: chiclet, chifflot, chirop, chuc luxembourg: petitprince lyon: gemini, hercule, neowise, nova, orion, pyxis, sagittaire, sirius, taurus nancy: graffiti, grappe, grat, grdix, grele, gres, gros, grosminet, grostiti, grouille, grue, gruss, grvingt nantes: econome, ecotype rennes: abacus1, abacus10, abacus11, abacus12, abacus14, abacus16, abacus17, abacus18, abacus19, abacus2, abacus20, abacus21, abacus22, abacus25, abacus3, abacus4, abacus5, abacus8, abacus9, paradoxe, parasilo, roazhon1, roazhon10, roazhon11, roazhon12, roazhon13, roazhon2, roazhon3, roazhon4, roazhon5, roazhon6, roazhon7, roazhon8, roazhon9 sophia: uvb strasbourg: fleckenstein toulouse: montcalm |
prom_nvgpu_all_metrics* | All metrics from Prometheus Nvidia DCGM Exporter | grenoble: drac lille: chifflot, chuc lyon: gemini, sirius nancy: graffiti, grat, grele, gres, grouille, grue, gruss rennes: abacus16, abacus25 |
prom_nvgpu_default_metrics** | Default subset of metrics from Prometheus Nvidia DCGM Exporter: DCGM_FI_DEV_GPU_TEMP, DCGM_FI_DEV_MEM_CLOCK, DCGM_FI_DEV_MEM_COPY_UTIL, DCGM_FI_DEV_POWER_USAGE, DCGM_FI_DEV_SM_CLOCK |
grenoble: drac lille: chifflot, chuc lyon: gemini, sirius nancy: graffiti, grat, grele, gres, grouille, grue, gruss rennes: abacus16, abacus25 |
wattmetre_power_watt | Power consumption XXX reported by wattmetre, in watt | grenoble: servan, troll, yeti, wattmetre1, wattmetre2 lyon: gemini, neowise, nova, orion, pyxis, sagittaire, sirius, taurus, wattmetre1, wattmetrev3-1, wattmetrev3-2 nancy: gros⁺, gros-wattmetre2 rennes: paradoxe, wattmetrev3-1 |