Latest revision as of 10:13, 26 August 2025

	Note
	This page is actively maintained by the Grid'5000 team. If you encounter problems, please report them (see the Support page). Additionally, as it is a wiki page, you are free to make minor corrections yourself if needed. If you would like to suggest a more fundamental change, please contact the Grid'5000 team.

This page describes the monitoring service available in Grid’5000 based on Kwollect to retrieve environmental and performance metrics from nodes.

The service currently provides metrics for:

Energy consumption from dedicated “wattmetre” devices (currently available for some clusters in Lyon, Grenoble, Nancy)
Metrics collected from nodes’ Board Management Controller ("BMC": out-of-band management hardware, such as Dell iDRAC), such as ambient temperature, hardware component temperature, energy consumption from PSU, fan speed, etc.
Traffic collected from network devices
Energy consumption from PDU, when available
Node metrics from Prometheus node exporter (and Nvidia DCGM exporter when GPU is available)

Metrics available

The list of metrics available for a given Grid’5000 cluster is described in the Reference API, under the “metrics” entry of the cluster description. For instance, to get the list of available metrics for nodes of taurus cluster, you can use (API requests must be performed from inside Grid'5000 or need to supply authentication credentials):

$ curl https://api.grid5000.fr/stable/sites/lyon/clusters/taurus | jq .metrics

This returns a list where each element describes a single metric. Most important fields of that description are:

name: The name identifying the metric
description: A human-readable description of the metric
labels (optional): A “label” that will be used to distinguish two metrics of the same kind, but targeting different objects (e.g. temperature of CPU #1 vs temperature of CPU #2)
period and optional_period: The interval (in milli-seconds) under which the metric is collected. The former is the default interval, the latter is the interval used when the metric is activated “on demand” (see below). A metric with period value at 0 is not collected by default.
only_for (optional): When the metric is not available on all nodes of the cluster, the list of nodes where it is available

The full list of metrics is available at the end of this page.

Getting metrics values

The metrics values are stored by Kwollect and available using the Metrology API by performing a GET request with appropriate parameters at URL:

https://api.grid5000.fr/stable/sites/<site>/metrics

For instance, to:

get all metrics collected inside job 1304978 at Lyon:

curl 'https://api.grid5000.fr/stable/sites/lyon/metrics?job_id=1304978'

get all metrics from chifflot-5 and chifflot-6 between 2021-06-08 15:00 and 2021-06-08 17:00:

curl 'https://api.grid5000.fr/stable/sites/lille/metrics?nodes=chifflot-5,chifflot-6&start_time=2021-06-08T15:00&end_time=2021-06-08T17:00'

get all values from Wattmetre for taurus-10, during last 15 minutes:

curl "https://api.grid5000.fr/stable/sites/lyon/metrics?nodes=taurus-10&metrics=wattmetre_power_watt&start_time=$(date -d '15 min ago' +%s)"

The request will return a JSON-formatted list of metrics values and their associated timestamp.

For a complete description of parameters and returned fields, see the API specification at: https://api.grid5000.fr/doc/stable/#tag/metrics

Important note: To avoid overloading Kwollect servers, request size is limited. When too much metrics values are requested, you may hit that limit and receive an error message. In such case, try to reduce the size of your request by selecting less nodes or a shorter time period.

Tip: If you need metrics formatted as CSV, you can use this command line :

curl <URL_to_request_some_metrics> | jq -r '.[] | [.timestamp, .device_id, .metric_id, .value, .labels|tostring] | @csv'

On-demand metrics

Some metrics are not collected by default and must be activated “on-demand”. These metrics have a period field equal to 0 in their description (see above).

“On-demand” can be enabled for specific jobs by adding the “-t monitor=<metric_to_enable>” option to oarsub. E.g:

$ oarsub -I -t monitor=bmc_cpu_temp_celsius

The provided argument can be a regular expression For instance, to enable all metrics related to temperature:

$ oarsub -I -t monitor='.*temp.*'

To enable all “On-demand” metrics available, the -t monitor option can be used.

Note:

Dedicated wattmetre devices (metric “wattmetre_power_watt”) are able to perform one measurement every 20 milli-seconds. However, this high frequency is only provided using “on demand” activation. For instance, submit your job using:

$ oarsub -I -t monitor='wattmetre_power_watt'

By default, only the value averaged over one second is provided.

As prometheus metrics depend on node characteristics, they cannot be fully described. Only a subset of prometheus metrics will be collected by default (described in the API by the prom_default_metrics and prom_nvgpu_default_metrics, when relevant). To enable collecting all prometheus metrics, use "on-demand" activation on prom_all_metrics or prom_nvgpu_all_metrics (for instance, use monitor='prom_.*')

Monitoring of default set of prometheus metrics (prom_default_metrics) is enabled for job running the standard environment. For deployed nodes, prometheus monitoring must be activated "on-demand"

Visualization Dashboard

A visualization dashboard based on Grafana is available. Metrics can be displayed by job ID or by date and graphics can be updated in real time with new values.

Dashboards are available at:

https://api.grid5000.fr/stable/sites/<site>/metrics/dashboard

Grenoble: https://api.grid5000.fr/stable/sites/grenoble/metrics/dashboard
Lille: https://api.grid5000.fr/stable/sites/lille/metrics/dashboard
Luxembourg: https://api.grid5000.fr/stable/sites/luxembourg/metrics/dashboard
Lyon: https://api.grid5000.fr/stable/sites/lyon/metrics/dashboard
Nancy: https://api.grid5000.fr/stable/sites/nancy/metrics/dashboard
Nantes: https://api.grid5000.fr/stable/sites/nantes/metrics/dashboard
Rennes: https://api.grid5000.fr/stable/sites/rennes/metrics/dashboard
Sophia: https://api.grid5000.fr/stable/sites/sophia/metrics/dashboard
Strasbourg: https://api.grid5000.fr/stable/sites/strasbourg/metrics/dashboard
Toulouse: https://api.grid5000.fr/stable/sites/toulouse/metrics/dashboard

Notes:

If dashboard's time frame is longer than 30 minutes, some "summarized" values (averaged over 5 minutes) will be displayed instead of the actual values.
Metrics whose name end by "_total" are displayed as a "per second changing rate".
When filling a job number, the dashboard's displayed time frame may not be adjusted automatically to the job's begin and end date.
The list of devices and metrics is retrieved from what's available at the end of the displayed time frame when the dashboard is loaded. If the device or metric you are looking for does not appear, be sure to adjust the time frame to a period where your device or metric exists and force refreshing the lists by reloading the web page from your browser.

Pushing custom metrics

A simple mechanism is available to let you push your own, arbitrary, custom metrics. To push metrics fetched inside a node to Kwollect, the following a POST request can be performed to following API endpoint:

https://api.grid5000.fr/stable/sites/SITE/metrics

The request must include the metric to be inserted, formatted as a JSON like:

{"metric_id": "METRIC_NAME", "value": VALUE}

Optionally, a "timestamp" value can be provided (otherwise, the current time will be used as metric's timestamp). The "device_id" field can also be provided (if it corresponds to a node under reservation by user making the request), otherwise, the node which the request originates will be used.

Known problems

Comparison of the power comsumption reported by a BMC vs. a wattmeter.

Metrics from BMC are quite unreliable. They may be inaccurate, highly averaged, or unavailable on some nodes from time to time.
Some metrics are not available for every nodes of a cluster: For grcinq and hercule clusters, one every four nodes have more metrics available than the others ; on parasilo and paravance clusters, the measurement of the power consumption by PDUs is only available on some nodes, wattmeters are only available on the gros-[41-76] nodes. Such metric have and "only_for" entry in their description indicating nodes where it is available.
It may happen that few values of a metric are not collected quickly enough to comply with interval described in the Reference API (for instance, when the targeted device is overloaded).
Electrical consumption reported by PDU is not always reliable, see Power_Monitoring_Devices#measurement_artifacts_and_pitfalls

Metrics available in Grid'5000

Metrics marked with * must be activated on demand, and metrics marked with ** are activated on non-deploy jobs by default. Clusters marked with ⁺ do not have metric available on all its nodes

Metric Name	Description	Available on
bmc_ambient_temp_celsius*	XXXemperature reported by BMC, in celsius	grenoble: chartreuse2, chartreuse3, dahu, drac, kinovis, servan, troll, yeti lille: chiclet, chicoree, chifflot, chirop, chuc luxembourg: clervaux, larochette, vianden lyon: gemini, neowise, nova, orion, pyxis, sirius, taurus nancy: graffiti, grappe, grat, gratouille, grele, gres, gros, grosminet, grostiti, grouille, grue, gruss, grvingt nantes: econome, ecotaxe, ecotype rennes: abacus16, abacus25, abacus28, paradoxe, parasilo, roazhon4 sophia: musa strasbourg: fleckenstein toulouse: montcalm
bmc_cpu_power_watt*	XXXt	grenoble: chartreuse2, drac lyon: pyxis, sirius toulouse: montcalm
bmc_cpu_temp_celsius*	Temperature of CXXX reported by BMC, in celsius	grenoble: dahu, drac, kinovis, servan, troll, yeti lille: chiclet, chicoree, chifflot, chirop, chuc lyon: gemini, neowise, nova, orion, pyxis, sirius, taurus nancy: graffiti, grappe, grat, gratouille, grele, gres, gros, grosminet, grouille, grue, gruss, grvingt nantes: econome, ecotaxe, ecotype rennes: abacus16, abacus25, abacus28, paradoxe, parasilo, roazhon4 sophia: musa strasbourg: fleckenstein toulouse: montcalm
bmc_cpu_usage_percent*	Usage of CPU reported by BMC, in percent	toulouse: montcalm
bmc_dimm_power_watt	Power of DIMM reported by BMC, in watt	toulouse: montcalm
bmc_dimm_temp_celsius*	Temperature of XXX reported by BMC, in celsius	grenoble: drac, kinovis lille: chicoree, chirop, chuc lyon: gemini, neowise, pyxis nancy: grat, gres, grosminet nantes: ecotaxe rennes: abacus25, paradoxe sophia: musa strasbourg: fleckenstein toulouse: montcalm
bmc_exhaust_temp_celsius*	Temperature XXX reported by BMC, in celsius	grenoble: chartreuse3, servan, troll, yeti lille: chiclet, chifflot luxembourg: larochette, vianden lyon: orion, taurus nancy: grappe, gratouille, grele, gros, grostiti, grouille, grue, gruss nantes: ecotype rennes: abacus16, abacus28, parasilo, roazhon4
bmc_fan_power_watt*	Power consumption of Fan reported by BMC, in watt	grenoble: drac
bmc_fan_speed_rpm*	Speed of XXX reported by BMC, in rpm	grenoble: chartreuse2, chartreuse3, dahu, drac, servan, troll, yeti lille: chiclet, chifflot luxembourg: clervaux lyon: gemini, neowise, nova, orion, pyxis, sirius, taurus nancy: graffiti, grappe, gratouille, grele, gros, grouille, grue, gruss, grvingt nantes: econome, ecotype rennes: abacus16, abacus28, parasilo, roazhon4
bmc_fan_usage_percent*	Usage of Fan XXX reported by BMC, in percent	grenoble: kinovis lille: chirop, chuc nancy: grat, gres nantes: ecotaxe rennes: abacus25, paradoxe sophia: musa toulouse: montcalm
bmc_gpu_power_watt	Power consumption of GPU XXX by BMC, in watt	grenoble: drac lyon: gemini, sirius
bmc_gpu_temp_celsius*	Temperature of GXXX reported by BMC, in celsius	grenoble: drac lille: chifflot lyon: gemini, neowise, sirius nancy: grat, grouille, grue, gruss rennes: abacus25
bmc_mem_power_watt*	Power consumption of Mem ProcXXX reported by BMC, in watt	grenoble: drac
bmc_node_power_watt*	Power XXX reported by BMC, in watt	grenoble: chartreuse2, chartreuse3, dahu, drac, kinovis, servan, troll, yeti lille: chiclet, chifflot, chirop luxembourg: clervaux, larochette, vianden lyon: gemini, neowise, orion, pyxis, sirius, taurus nancy: graffiti, grappe, grat, gratouille, grdix, grele, gres, gros, grostiti, grouille, grue, gruss, grvingt nantes: ecotaxe, ecotype rennes: abacus16, abacus25, abacus28, paradoxe, parasilo, roazhon4 sophia: musa strasbourg: fleckenstein toulouse: montcalm
bmc_node_power_watthour_total*	Cumulated power consumption of node reported by BMC, in wattXXX	grenoble: dahu, servan, troll, yeti lille: chiclet, chifflot lyon: orion, taurus nancy: graffiti, grappe, gratouille, grele, gros, grouille, grue, gruss, grvingt nantes: ecotype rennes: abacus16, abacus28, parasilo, roazhon4
bmc_other_airflow_cfm	Airflow reported by BMC, in CFM	lyon: gemini
bmc_other_current_amp*	Current of XXX reported by BMC, in amp	grenoble: chartreuse2, chartreuse3, drac, kinovis lille: chirop, chuc luxembourg: clervaux, larochette, vianden lyon: neowise, pyxis nancy: grat, grdix, gres, grostiti nantes: econome, ecotaxe rennes: abacus25, paradoxe sophia: musa
bmc_other_power_watt*	Power consumption of XXX reported by BMC, in watt	grenoble: chartreuse2, chartreuse3, drac, kinovis lille: chirop, chuc louvain: spirou lyon: gemini, sirius nancy: grat, grdix, gres nantes: ecotaxe rennes: abacus25, paradoxe sophia: musa
bmc_other_speed_rpm*	Speed of FanXXX reported by BMC, in rpm	louvain: spirou luxembourg: larochette, vianden nancy: grostiti
bmc_other_temp_celsius*	Temperature of XXX reported by BMC, in celsius	grenoble: chartreuse2, chartreuse3, drac, kinovis lille: chicoree, chirop, chuc louvain: spirou luxembourg: clervaux, larochette, vianden lyon: gemini, hydra, neowise, pyxis, sirius nancy: grat, grdix, gres, grosminet, grostiti nantes: econome, ecotaxe rennes: abacus25, paradoxe sophia: esterel27, musa strasbourg: fleckenstein toulouse: montcalm
bmc_other_usage_percent*	Usage of XXX reported by BMC, in percent	grenoble: chartreuse2, chartreuse3 lille: chuc luxembourg: clervaux, larochette, vianden lyon: hydra
bmc_other_voltage_volt*	Voltage of XXX reported by BMC, in volt	grenoble: chartreuse2, chartreuse3, drac, kinovis lille: chirop, chuc louvain: spirou luxembourg: clervaux, larochette, vianden lyon: gemini, hydra, neowise, pyxis, sirius nancy: grat, grdix, gres, grostiti nantes: econome, ecotaxe rennes: abacus25, paradoxe sophia: esterel27, musa
bmc_psu_current_amp*	Current of PSU XXX reported by BMC, in amp	grenoble: dahu, servan, troll, yeti lille: chiclet, chifflot lyon: orion, taurus nancy: graffiti, grappe, gratouille, grele, gros, grouille, grue, gruss, grvingt nantes: ecotype rennes: abacus16, abacus28, parasilo, roazhon4
bmc_psu_power_watt*	Power XXX reported by BMC, in watt	grenoble: kinovis lille: chirop, chuc lyon: gemini, sirius nancy: grat, grdix, gres nantes: ecotaxe rennes: abacus25, paradoxe sophia: musa toulouse: montcalm
bmc_psu_temp_celsius*	Temperature of PXXX reported by BMC, in celsius	lyon: hydra, sirius toulouse: montcalm
bmc_psu_voltage_volt*	Voltage of PSU XXX reported by BMC, in volt	grenoble: dahu, servan, troll, yeti lille: chiclet, chifflot lyon: orion, taurus nancy: graffiti, grappe, gratouille, grele, gros, grouille, grue, gruss, grvingt nantes: ecotype rennes: abacus16, abacus28, parasilo, roazhon4
network_ifacein_bytes_total	Input byte counter for the network device port	bordeaux: gw grenoble: dahu, drac, sasquatch, servan, troll, yeti, gw lille: chiclet, chifflot, gw, sw-chiclet-1 luxembourg: clervaux, larochette, vianden, gw, sw-b04, sw-b09 lyon: gemini, hercule, neowise, nova, orion, pyxis, sagittaire, sirius, taurus, force10, gw, salome nancy: grappe, grat, gratouille, grdix, grele, gres, gros, grosminet, grostiti, grouille, gruss, grvingt, gw, sgrappe, sgravillon2, sgrdix, sgros1, sgros2, sgruss, sgrvingt nantes: econome, ecotaxe, ecotype, econome-prod, ecot-prod, ecotype-prod2, gw rennes: abacus29, parasilo, roazhon15, parasilo-sw-1 sophia: esterel31, esterel38, mercantour3, sw-4, sw-7
network_ifacein_packets_discard_total	Input counter of discarded packets for the network device port	bordeaux: gw grenoble: dahu, drac, sasquatch, servan, troll, yeti, gw lille: chiclet, chifflot, gw, sw-chiclet-1 luxembourg: clervaux, larochette, vianden, gw, sw-b04, sw-b09 lyon: gemini, hercule, neowise, nova, orion, pyxis, sagittaire, sirius, taurus, force10, gw, salome nancy: grappe, grat, gratouille, grdix, grele, gres, gros, grosminet, grostiti, grouille, gruss, grvingt, gw, sgrappe, sgravillon2, sgrdix, sgros1, sgros2, sgruss, sgrvingt nantes: econome, ecotaxe, ecotype, econome-prod, ecot-prod, ecotype-prod2, gw rennes: abacus29, parasilo, roazhon15, parasilo-sw-1 sophia: esterel31, esterel38, mercantour3, sw-4, sw-7
network_ifacein_packets_error_total	Input counter of packet errors for the network device port	bordeaux: gw grenoble: dahu, drac, sasquatch, servan, troll, yeti, gw lille: chiclet, chifflot, gw, sw-chiclet-1 luxembourg: clervaux, larochette, vianden, gw, sw-b04, sw-b09 lyon: gemini, hercule, neowise, nova, orion, pyxis, sagittaire, sirius, taurus, force10, gw, salome nancy: grappe, grat, gratouille, grdix, grele, gres, gros, grosminet, grostiti, grouille, gruss, grvingt, gw, sgrappe, sgravillon2, sgrdix, sgros1, sgros2, sgruss, sgrvingt nantes: econome, ecotaxe, ecotype, econome-prod, ecot-prod, ecotype-prod2, gw rennes: abacus29, parasilo, roazhon15, parasilo-sw-1 sophia: esterel31, esterel38, mercantour3, sw-4, sw-7
network_ifacein_packets_total	Input packet counter for the network device port	bordeaux: gw grenoble: dahu, drac, sasquatch, servan, troll, yeti, gw lille: chiclet, chifflot, gw, sw-chiclet-1 luxembourg: clervaux, larochette, vianden, gw, sw-b04, sw-b09 lyon: gemini, hercule, neowise, nova, orion, pyxis, sagittaire, sirius, taurus, force10, gw, salome nancy: grappe, grat, gratouille, grdix, grele, gres, gros, grosminet, grostiti, grouille, gruss, grvingt, gw, sgrappe, sgravillon2, sgrdix, sgros1, sgros2, sgruss, sgrvingt nantes: econome, ecotaxe, ecotype, econome-prod, ecot-prod, ecotype-prod2, gw rennes: abacus29, parasilo, roazhon15, parasilo-sw-1 sophia: esterel31, esterel38, mercantour3, sw-4, sw-7
network_ifaceout_bytes_total	Output byte counter for the network device port	bordeaux: gw grenoble: dahu, drac, sasquatch, servan, troll, yeti, gw lille: chiclet, chifflot, gw, sw-chiclet-1 luxembourg: clervaux, larochette, vianden, gw, sw-b04, sw-b09 lyon: gemini, hercule, neowise, nova, orion, pyxis, sagittaire, sirius, taurus, force10, gw, salome nancy: grappe, grat, gratouille, grdix, grele, gres, gros, grosminet, grostiti, grouille, gruss, grvingt, gw, sgrappe, sgravillon2, sgrdix, sgros1, sgros2, sgruss, sgrvingt nantes: econome, ecotaxe, ecotype, econome-prod, ecot-prod, ecotype-prod2, gw rennes: abacus29, parasilo, roazhon15, parasilo-sw-1 sophia: esterel31, esterel38, mercantour3, sw-4, sw-7
network_ifaceout_packets_discard_total	Output counter of discarded packets for the network device port	bordeaux: gw grenoble: dahu, drac, sasquatch, servan, troll, yeti, gw lille: chiclet, chifflot, gw, sw-chiclet-1 luxembourg: clervaux, larochette, vianden, gw, sw-b04, sw-b09 lyon: gemini, hercule, neowise, nova, orion, pyxis, sagittaire, sirius, taurus, force10, gw, salome nancy: grappe, grat, gratouille, grdix, grele, gres, gros, grosminet, grostiti, grouille, gruss, grvingt, gw, sgrappe, sgravillon2, sgrdix, sgros1, sgros2, sgruss, sgrvingt nantes: econome, ecotaxe, ecotype, econome-prod, ecot-prod, ecotype-prod2, gw rennes: abacus29, parasilo, roazhon15, parasilo-sw-1 sophia: esterel31, esterel38, mercantour3, sw-4, sw-7
network_ifaceout_packets_error_total	Output counter of packet errors for the network device port	bordeaux: gw grenoble: dahu, drac, sasquatch, servan, troll, yeti, gw lille: chiclet, chifflot, gw, sw-chiclet-1 luxembourg: clervaux, larochette, vianden, gw, sw-b04, sw-b09 lyon: gemini, hercule, neowise, nova, orion, pyxis, sagittaire, sirius, taurus, force10, gw, salome nancy: grappe, grat, gratouille, grdix, grele, gres, gros, grosminet, grostiti, grouille, gruss, grvingt, gw, sgrappe, sgravillon2, sgrdix, sgros1, sgros2, sgruss, sgrvingt nantes: econome, ecotaxe, ecotype, econome-prod, ecot-prod, ecotype-prod2, gw rennes: abacus29, parasilo, roazhon15, parasilo-sw-1 sophia: esterel31, esterel38, mercantour3, sw-4, sw-7
network_ifaceout_packets_total	Output packet counter for the network device port	bordeaux: gw grenoble: dahu, drac, sasquatch, servan, troll, yeti, gw lille: chiclet, chifflot, gw, sw-chiclet-1 luxembourg: clervaux, larochette, vianden, gw, sw-b04, sw-b09 lyon: gemini, hercule, neowise, nova, orion, pyxis, sagittaire, sirius, taurus, force10, gw, salome nancy: grappe, grat, gratouille, grdix, grele, gres, gros, grosminet, grostiti, grouille, gruss, grvingt, gw, sgrappe, sgravillon2, sgrdix, sgros1, sgros2, sgruss, sgrvingt nantes: econome, ecotaxe, ecotype, econome-prod, ecot-prod, ecotype-prod2, gw rennes: abacus29, parasilo, roazhon15, parasilo-sw-1 sophia: esterel31, esterel38, mercantour3, sw-4, sw-7
pdu_group_voltage_volt*	Voltage per group reported by PDU, in volt	lyon: pdu3a, pdu3b toulouse: epdu-d, epdu-g
pdu_outlet_current_amp*	Current XXX reported by PDU, in ampere	lyon: pyxis, sagittaire, pdu3a, pdu3b toulouse: epdu-d, epdu-g
pdu_outlet_currentcrestfactor*	Current crest factor XXX reported by PDU, as a ratio	lyon: pyxis, sagittaire, pdu3a, pdu3b toulouse: epdu-d, epdu-g
pdu_outlet_currentload_percentage*	Current load percent XXX reported by PDU, as a percentage	lyon: pyxis, sagittaire, pdu3a, pdu3b toulouse: epdu-d, epdu-g
pdu_outlet_energy_watthours_total	Energy consumption XXX reported by PDU, in watt.hours	lyon: pyxis, sagittaire, pdu3a, pdu3b toulouse: epdu-d, epdu-g
pdu_outlet_energyreset_timestamp*	Last reset time of energy consumption counter XXX reported by PDU, in unix timestamp	lyon: pyxis, sagittaire, pdu3a, pdu3b toulouse: epdu-d, epdu-g
pdu_outlet_power_var*	Reactive power consumption XXX reported by PDU, in volt-ampere reactive	lyon: pyxis, sagittaire, pdu3a, pdu3b toulouse: epdu-d, epdu-g
pdu_outlet_power_voltamp*	Apparent power XXX reported by PDU, in volt-ampere	lyon: pyxis, sagittaire, pdu3a, pdu3b toulouse: epdu-d, epdu-g
pdu_outlet_power_watt	Power consumption XXX reported by PDU, in watt	grenoble: pdu-kinovis2-1, pdu-kinovis2-2, pdu-kinovis2-3, pdu-kinovis2-4 lille: pdu-b1p1, pdu-b1p2, pdu-b1p3, pdu-b1p3-2, pdu-b2p1, pdu-b2p2 lyon: pyxis, sagittaire, pdu3a, pdu3b nancy: gros, grimoire-pdu1, grimoire-pdu2, gros-pdu1, gros-pdu2, gros-pdu3, gros-pdu4, gros-pdu5, gros-pdu6, gros-pdu7, gros-pdu8, gruss-pdu1, gruss-pdu2 nantes: pdu-Z1-10, pdu-Z1-11, pdu-Z1-20, pdu-Z1-21, pdu-Z1-40, pdu-Z1-41, pdu-Z1-50, pdu-Z1-51 rennes: parasilo⁺, parasilo-pdu-2, parasilo-pdu-3, parasilo-pdu-4 toulouse: epdu-d, epdu-g
pdu_outlet_powerfactor*	Power factor XXX reported by PDU, as a ratio	lyon: pyxis, sagittaire, pdu3a, pdu3b toulouse: epdu-d, epdu-g
pdu_phase_PowerCapacity*	Powercapacity reported by PDU	lyon: pdu3a, pdu3b toulouse: epdu-d, epdu-g
pdu_phase_current_amp*	Current per phase reported by PDU, in ampere	lyon: pdu3a, pdu3b toulouse: epdu-d, epdu-g
pdu_phase_currentcrestfactor*	Current crest factor per phase reported by PDU, as a ratio	lyon: pdu3a, pdu3b toulouse: epdu-d, epdu-g
pdu_phase_currentload_percentage*	Current load percent per phase reported by PDU, as a percentage	lyon: pdu3a, pdu3b toulouse: epdu-d, epdu-g
pdu_phase_energy_watthours_total*	Energy consumption per phase reported by PDU, in watt.hours	lyon: pdu3a, pdu3b toulouse: epdu-d, epdu-g
pdu_phase_energyreset_timestamp*	Last reset time of energy consumption counter per phase reported by PDU, in unix timestamp	lyon: pdu3a, pdu3b toulouse: epdu-d, epdu-g
pdu_phase_power_var*	Reactive power consumption per phase reported by PDU, in volt-ampere reactive	lyon: pdu3a, pdu3b toulouse: epdu-d, epdu-g
pdu_phase_power_voltamp*	Apparent power per phase reported by PDU, in volt-ampere	lyon: pdu3a, pdu3b toulouse: epdu-d, epdu-g
pdu_phase_power_watt*	Power consumption per phase reported by PDU, in watt	lyon: pdu3a, pdu3b toulouse: epdu-d, epdu-g
pdu_phase_powerfactor*	Power factor per phase reported by PDU, as a ratio	lyon: pdu3a, pdu3b toulouse: epdu-d, epdu-g
pdu_phase_voltage_volt*	Voltage per phase reported by PDU, in volt	lyon: pdu3a, pdu3b toulouse: epdu-d, epdu-g
pdu_temperature_celsius	Sensor temperature reading, in degrees Celsius	nantes: pdu-Z1-11, pdu-Z1-21, pdu-Z1-41, pdu-Z1-51
pdu_total_energy_watthours_total	Energy consumption in total reported by PDU, in watt.hours	lyon: pdu3a, pdu3b toulouse: epdu-d, epdu-g
pdu_total_energyreset_timestamp*	Last reset time of energy consumption counter in total reported by PDU, in unix timestamp	lyon: pdu3a, pdu3b toulouse: epdu-d, epdu-g
pdu_total_power_var*	Reactive power consumption in total reported by PDU, in volt-ampere reactive	lyon: pdu3a, pdu3b toulouse: epdu-d, epdu-g
pdu_total_power_voltamp*	Apparent power in total reported by PDU, in volt-ampere	lyon: pdu3a, pdu3b toulouse: epdu-d, epdu-g
pdu_total_power_watt	Power consumption in total reported by PDU, in watt	lyon: pdu3a, pdu3b toulouse: epdu-d, epdu-g
pdu_total_powerfactor*	Power factor in total reported by PDU, as a ratio	lyon: pdu3a, pdu3b toulouse: epdu-d, epdu-g
prom_all_metrics*	All metrics from Prometheus Node Exporter	grenoble: chartreuse2, chartreuse3, chartreuse4, chartreuse6, chartreuse7, dahu, drac, kinovis, sasquatch, servan, troll, vercors10, vercors11, vercors12, vercors13, vercors14, vercors16, vercors17, vercors18, vercors2, vercors3, vercors4, vercors5, vercors7, vercors8, vercors9, yeti lille: chiclet, chicoree, chifflot, chirop, chuc louvain: spirou luxembourg: clervaux, larochette, vianden lyon: gemini, hercule, hydra, neowise, nova, orion, pyxis, sagittaire, sirius, taurus nancy: graffiti, grappe, grat, gratouille, grdix, grele, gres, gros, grosminet, grostiti, grouille, grue, gruss, grvingt nantes: econome, ecotaxe, ecotype rennes: abacus1, abacus10, abacus11, abacus12, abacus14, abacus16, abacus17, abacus18, abacus19, abacus2, abacus20, abacus21, abacus22, abacus25, abacus26, abacus27, abacus28, abacus29, abacus3, abacus4, abacus5, abacus8, abacus9, paradoxe, parasilo, roazhon1, roazhon10, roazhon11, roazhon12, roazhon13, roazhon14, roazhon15, roazhon2, roazhon3, roazhon4, roazhon5, roazhon6, roazhon7, roazhon8, roazhon9 sophia: esterel10, esterel11, esterel12, esterel13, esterel14, esterel15, esterel16, esterel17, esterel19, esterel2, esterel20, esterel21, esterel22, esterel23, esterel24, esterel25, esterel26, esterel27, esterel28, esterel29, esterel3, esterel30, esterel31, esterel32, esterel33, esterel34, esterel35, esterel36, esterel37, esterel38, esterel39, esterel4, esterel40, esterel41, esterel42, esterel43, esterel44, esterel5, esterel6, esterel7, esterel8, esterel9, mercantour1, mercantour2, mercantour3, mercantour4, mercantour5, mercantour6, mercantour7, musa, uvb strasbourg: engelbourg, fleckenstein, ramstein toulouse: estats, montcalm
prom_default_metrics**	Default subset of metrics from Prometheus Node Exporter: kwollect_custom, node_boot_time_seconds, node_cpu_scaling_frequency_hertz, node_cpu_seconds_total, node_filesystem_free_bytes, node_filesystem_size_bytes, node_load1, node_load15, node_load5, node_memory_Buffers_bytes, node_memory_Cached_bytes, node_memory_MemAvailable_bytes, node_memory_MemFree_bytes, node_memory_MemTotal_bytes, node_memory_Shmem_bytes, node_memory_SwapFree_bytes, node_memory_SwapTotal_bytes, node_network_receive_bytes_total, node_network_receive_packets_total, node_network_transmit_bytes_total, node_network_transmit_packets_total, node_procs_blocked, node_procs_running	grenoble: chartreuse2, chartreuse3, chartreuse4, chartreuse6, chartreuse7, dahu, drac, kinovis, sasquatch, servan, troll, vercors10, vercors11, vercors12, vercors13, vercors14, vercors16, vercors17, vercors18, vercors2, vercors3, vercors4, vercors5, vercors7, vercors8, vercors9, yeti lille: chiclet, chicoree, chifflot, chirop, chuc louvain: spirou luxembourg: clervaux, larochette, vianden lyon: gemini, hercule, hydra, neowise, nova, orion, pyxis, sagittaire, sirius, taurus nancy: graffiti, grappe, grat, gratouille, grdix, grele, gres, gros, grosminet, grostiti, grouille, grue, gruss, grvingt nantes: econome, ecotaxe, ecotype rennes: abacus1, abacus10, abacus11, abacus12, abacus14, abacus16, abacus17, abacus18, abacus19, abacus2, abacus20, abacus21, abacus22, abacus25, abacus26, abacus27, abacus28, abacus29, abacus3, abacus4, abacus5, abacus8, abacus9, paradoxe, parasilo, roazhon1, roazhon10, roazhon11, roazhon12, roazhon13, roazhon14, roazhon15, roazhon2, roazhon3, roazhon4, roazhon5, roazhon6, roazhon7, roazhon8, roazhon9 sophia: esterel10, esterel11, esterel12, esterel13, esterel14, esterel15, esterel16, esterel17, esterel19, esterel2, esterel20, esterel21, esterel22, esterel23, esterel24, esterel25, esterel26, esterel27, esterel28, esterel29, esterel3, esterel30, esterel31, esterel32, esterel33, esterel34, esterel35, esterel36, esterel37, esterel38, esterel39, esterel4, esterel40, esterel41, esterel42, esterel43, esterel44, esterel5, esterel6, esterel7, esterel8, esterel9, mercantour1, mercantour2, mercantour3, mercantour4, mercantour5, mercantour6, mercantour7, musa, uvb strasbourg: engelbourg, fleckenstein, ramstein toulouse: estats, montcalm
prom_nvgpu_all_metrics*	All metrics from Prometheus Nvidia DCGM Exporter	grenoble: drac, kinovis, vercors10, vercors11, vercors12, vercors13, vercors14, vercors16, vercors17, vercors18, vercors2, vercors3, vercors4, vercors5, vercors7, vercors8, vercors9 lille: chicoree, chifflot, chuc lyon: gemini, sirius nancy: graffiti, grat, gratouille, grele, gres, grouille, grue, gruss nantes: ecotaxe rennes: abacus1, abacus10, abacus11, abacus12, abacus14, abacus16, abacus17, abacus18, abacus19, abacus2, abacus20, abacus21, abacus22, abacus25, abacus26, abacus27, abacus28, abacus29, abacus3, abacus4, abacus5, abacus8, abacus9 sophia: esterel10, esterel11, esterel12, esterel13, esterel14, esterel15, esterel16, esterel17, esterel19, esterel2, esterel20, esterel21, esterel22, esterel23, esterel24, esterel25, esterel26, esterel27, esterel28, esterel29, esterel3, esterel30, esterel31, esterel32, esterel33, esterel34, esterel35, esterel36, esterel37, esterel38, esterel39, esterel4, esterel40, esterel41, esterel42, esterel43, esterel44, esterel5, esterel6, esterel7, esterel8, esterel9, musa
prom_nvgpu_default_metrics**	Default subset of metrics from Prometheus Nvidia DCGM Exporter: DCGM_FI_DEV_FB_USED, DCGM_FI_DEV_GPU_TEMP, DCGM_FI_DEV_GPU_UTIL, DCGM_FI_DEV_MEM_COPY_UTIL, DCGM_FI_DEV_POWER_USAGE, DCGM_FI_DEV_SM_CLOCK, DCGM_FI_PROF_PIPE_TENSOR_ACTIVE	grenoble: drac, kinovis, vercors10, vercors11, vercors12, vercors13, vercors14, vercors16, vercors17, vercors18, vercors2, vercors3, vercors4, vercors5, vercors7, vercors8, vercors9 lille: chicoree, chifflot, chuc lyon: gemini, sirius nancy: graffiti, grat, gratouille, grele, gres, grouille, grue, gruss nantes: ecotaxe rennes: abacus1, abacus10, abacus11, abacus12, abacus14, abacus16, abacus17, abacus18, abacus19, abacus2, abacus20, abacus21, abacus22, abacus25, abacus26, abacus27, abacus28, abacus29, abacus3, abacus4, abacus5, abacus8, abacus9 sophia: esterel10, esterel11, esterel12, esterel13, esterel14, esterel15, esterel16, esterel17, esterel19, esterel2, esterel20, esterel21, esterel22, esterel23, esterel24, esterel25, esterel26, esterel27, esterel28, esterel29, esterel3, esterel30, esterel31, esterel32, esterel33, esterel34, esterel35, esterel36, esterel37, esterel38, esterel39, esterel4, esterel40, esterel41, esterel42, esterel43, esterel44, esterel5, esterel6, esterel7, esterel8, esterel9, musa
wattmetre_power_watt	Power consumption XXX reported by wattmetre, in watt	lille: chirop, wattmetrev3-1 lyon: gemini, hydra, neowise, nova, orion, pyxis, sagittaire, sirius, taurus, wattmetre1, wattmetrev3-1, wattmetrev3-2, wattmetrev3-hydra nancy: gros⁺, gros-wattmetre2 rennes: paradoxe, wattmetrev3-1, wattmetrev3-2 strasbourg: wattmetrev3-1

@@ Line 3: / Line 3: @@
 {{TutorialHeader}}
-This page describes the monitoring service available in Grid’5000 that uses [https://gitlab.inria.fr/grid5000/kwollect Kwollect] to retrieve environmental and performance metrics from nodes.
+This page describes the monitoring service available in Grid’5000 based on [https://gitlab.inria.fr/grid5000/kwollect Kwollect] to retrieve environmental and performance metrics from nodes.
 The service currently provides metrics for:
@@ Line 11: / Line 11: @@
 * Energy consumption from PDU, when available
 * Node metrics from Prometheus node exporter (and Nvidia DCGM exporter when GPU is available)
-{{Warning| text=Monitoring with Kwollect under Grid'5000 is still in beta phase. It uses "sid" branch of the API, while [[Monitoring_deployed_nodes|legacy monitoring API]] (based on Ganglia and Kwapi) still uses the "stable" branch. Kwollect is intended to replace the legacy system in the future}}
@@ Line 26: / Line 24: @@
 * ''labels'' (optional): A “label” that will be used to distinguish two metrics of the same kind, but targeting different objects (e.g. temperature of CPU #1 vs temperature of CPU #2)
 * ''period'' and ''optional_period'': The interval (in milli-seconds) under which the metric is collected. The former is the default interval, the latter is the interval used when the metric is activated “on demand” (see below). A metric with ''period'' value at 0 is not collected by default.
+* ''only_for'' (optional): When the metric is not available on all nodes of the cluster, the list of nodes where it is available
+The full list of metrics is available at the [[Monitoring_Using_Kwollect#Metrics_available_in_Grid.275000|end of this page]].
 = Getting metrics values =
-The metrics values are stored by Kwollect and available using the Metrology API (under “sid” version) by performing a GET request with appropriate parameters at URL:
+The metrics values are stored by Kwollect and available using the Metrology API by performing a GET request with appropriate parameters at URL:
-<pre>https://api.grid5000.fr/sid/sites/<site>/metrics</pre>
+<pre>https://api.grid5000.fr/stable/sites/<site>/metrics</pre>
 For instance, to:
-* get all metrics collected inside job 1157330 at Lyon:
+* get all metrics collected inside job 1304978 at Lyon:
-<pre>curl 'https://api.grid5000.fr/sid/sites/lyon/metrics?job_id=1157330'</pre>
+<pre>curl 'https://api.grid5000.fr/stable/sites/lyon/metrics?job_id=1304978'</pre>
-* get all metrics from chifflot-5 and chifflot-6 between 2020-06-08 15:00 and 2020-06-08 17:00:
+* get all metrics from chifflot-5 and chifflot-6 between 2021-06-08 15:00 and 2021-06-08 17:00:
-<pre>curl 'https://api.grid5000.fr/sid/sites/lille/metrics?nodes=chifflot-5,chifflot-6&start_time=2020-06-08T15:00&end_time=2020-06-08T17:00'</pre>
+<pre>curl 'https://api.grid5000.fr/stable/sites/lille/metrics?nodes=chifflot-5,chifflot-6&start_time=2021-06-08T15:00&end_time=2021-06-08T17:00'</pre>
-* get all values from Wattmetre for taurus-2, during last 15 minutes:
+* get all values from Wattmetre for taurus-10, during last 15 minutes:
-<pre>curl "https://api.grid5000.fr/sid/sites/lyon/metrics?nodes=taurus-2&metrics=wattmetre_power_watt&start_time=$(date -d '15 min ago' +%s)"</pre>
+<pre>curl "https://api.grid5000.fr/stable/sites/lyon/metrics?nodes=taurus-10&metrics=wattmetre_power_watt&start_time=$(date -d '15 min ago' +%s)"</pre>
 The request will return a JSON-formatted list of metrics values and their associated timestamp.
-For a complete description of parameters and returned fields, see the API specification at: https://api.grid5000.fr/doc/sid/#tag/metrics
+For a complete description of parameters and returned fields, see the API specification at: https://api.grid5000.fr/doc/stable/#tag/metrics
-'''Important note:''' To avoid overloading Kwollect servers, request duration is limited to 5 minutes and its size to 1GB. When too much metrics values are requested, you may hit that limit and receive an error message. In such case, try to reduce the size of your request by selecting less nodes or a shorter time period.
+'''Important note:''' To avoid overloading Kwollect servers, request size is limited. When too much metrics values are requested, you may hit that limit and receive an error message. In such case, try to reduce the size of your request by selecting less nodes or a shorter time period.
 '''Tip''': If you need metrics formatted as CSV, you can use this command line :
-  curl <URL_to_request_some_metrics> | jq -r '.[] | [.timestamp, .device_id, .metric_id, .value] | @csv'
+  curl <URL_to_request_some_metrics> | jq -r '.[] | [.timestamp, .device_id, .metric_id, .value, .labels|tostring] | @csv'
 = On-demand metrics =
@@ Line 82: / Line 83: @@
 * As prometheus metrics depend on node characteristics, they cannot be fully described. Only a subset of prometheus metrics will be collected by default (described in the API by the ''prom_default_metrics'' and ''prom_nvgpu_default_metrics'', when relevant). To enable collecting all prometheus metrics, use "on-demand" activation on ''prom_all_metrics'' or ''prom_nvgpu_all_metrics'' (for instance, use <code>monitor='prom_.*'</code>)
+* Monitoring of default set of prometheus metrics (''prom_default_metrics'') is enabled for job running the standard environment. For deployed nodes, prometheus monitoring must be activated "on-demand"
 = Visualization Dashboard =
-A vizualization dashboard based on grafana is available. Metrics can be displayed by job ID or by date and graphics can be updated in real time with new values.
+A visualization dashboard based on Grafana is available. Metrics can be displayed by job ID or by date and graphics can be updated in real time with new values.
 Dashboards are available at:
-<pre>https://api.grid5000.fr/sid/sites/<site>/metrics/dashboard</pre>
+<pre>https://api.grid5000.fr/stable/sites/<site>/metrics/dashboard</pre>
-For instance, at Lyon: https://api.grid5000.fr/sid/sites/lyon/metrics/dashboard
+* '''Grenoble:''' https://api.grid5000.fr/stable/sites/grenoble/metrics/dashboard
+* '''Lille:''' https://api.grid5000.fr/stable/sites/lille/metrics/dashboard
+* '''Luxembourg:''' https://api.grid5000.fr/stable/sites/luxembourg/metrics/dashboard
+* '''Lyon:''' https://api.grid5000.fr/stable/sites/lyon/metrics/dashboard
+* '''Nancy:''' https://api.grid5000.fr/stable/sites/nancy/metrics/dashboard
+* '''Nantes:''' https://api.grid5000.fr/stable/sites/nantes/metrics/dashboard
+* '''Rennes:''' https://api.grid5000.fr/stable/sites/rennes/metrics/dashboard
+* '''Sophia:''' https://api.grid5000.fr/stable/sites/sophia/metrics/dashboard
+* '''Strasbourg:''' https://api.grid5000.fr/stable/sites/strasbourg/metrics/dashboard
+* '''Toulouse:''' https://api.grid5000.fr/stable/sites/toulouse/metrics/dashboard
+Notes:
+* If dashboard's time frame is longer than 30 minutes, some "summarized" values (averaged over 5 minutes) will be displayed instead of the actual values.
+* Metrics whose name end by "_total" are displayed as a "per second changing rate".
+* When filling a job number, the dashboard's displayed time frame may not be adjusted automatically to the job's begin and end date.
+* The list of devices and metrics is retrieved from what's available at the end of the displayed time frame when the dashboard is loaded. If the device or metric you are looking for does not appear, be sure to adjust the time frame to a period where your device or metric exists and force refreshing the lists by reloading the web page from your browser.
 = Pushing custom metrics =
-A simple mechanism is available to let you push your own, arbitrary, custom metrics to Kwollect (internally it uses [https://github.com/prometheus/node_exporter#textfile-collector Prometheus Node Exporter "Textfile Collector"]). From a node, a custom metric will be collected by writing to a specific file (you currently need ''root'' privileges to do so):
+A simple mechanism is available to let you push your own, arbitrary, custom metrics. To push metrics fetched inside a node to Kwollect, the following a POST request can be performed to following API endpoint:
-  $ echo 'kwollect_custom{_metric_id="my_metric"} 42' > /var/lib/prometheus/node-exporter/kwollect.prom
+  https://api.grid5000.fr/stable/sites/SITE/metrics
-This will push a custom metric named ''my_metric'' and with value "42" (this file may contain several lines to push different values at a time).
+The request must include the metric to be inserted, formatted as a JSON like:
-The associated timestamp will be the time when Kwollect fetches metrics on the node (every 15 seconds under Grid'5000). You may override the timestamp by adding ''_timestamp="your_timestamp_in_unix_seconds"''. For instance:
+ {"metric_id": "METRIC_NAME", "value": VALUE}
- $ echo 'kwollect_custom{_metric_id="my_metric", _timestamp="1606389005.1234"} 42' > /var/lib/prometheus/node-exporter/kwollect.prom
+Optionally, a "timestamp" value can be provided (otherwise, the current time will be used as metric's timestamp). The "device_id" field can also be provided (if it corresponds to a node under reservation by user making the request), otherwise, the node which the request originates will be used.
-(You can also add your own labels. See Prometheus metrics [https://prometheus.io/docs/practices/naming/ naming] and [https://github.com/prometheus/docs/blob/master/content/docs/instrumenting/exposition_formats.md format] for more information).
-Kwollect will continuously pull values from <code>/var/lib/prometheus/node-exporter/kwollect.prom</code> file. It is your responsibility to keep its content updated (for instance, only one value for a particular metric should be present in this file at anytime).
 = Known problems =
+[[File:Kwollect bmc vs wattmetre.png|thumb|right|Comparison of the power comsumption reported by a BMC vs. a wattmeter.]]
-* Metrics from BMC are quite unreliable. They may be inaccurate, or unavailable on some nodes from time to time.
+* Metrics from BMC are quite unreliable. They may be inaccurate, highly averaged, or unavailable on some nodes from time to time.
-* Some metrics are not available for every nodes of a cluster: For grcinq and hercule clusters, one every four nodes have more metrics available than the others ; on parasilo and paravance clusters, the measurement of the power consumption by PDUs is only available on some nodes
+* Some metrics are not available for every nodes of a cluster: For grcinq and hercule clusters, one every four nodes have more metrics available than the others ; on parasilo and paravance clusters, the measurement of the power consumption by PDUs is only available on some nodes, wattmeters are only available on the gros-[41-76] nodes. Such metric have and "only_for" entry in their description indicating nodes where it is available.
 * It may happen that few values of a metric are not collected quickly enough to comply with interval described in the Reference API (for instance, when the targeted device is overloaded).
 * Electrical consumption reported by PDU is not always reliable, see [[Power_Monitoring_Devices#measurement_artifacts_and_pitfalls]]
+= Metrics available in Grid'5000 =
+{{:Generated/KwollectMetrics}}

Monitoring Using Kwollect: Difference between revisions

Latest revision as of 10:13, 26 August 2025

Contents

Metrics available

Getting metrics values

On-demand metrics

Visualization Dashboard

Pushing custom metrics

Known problems

Metrics available in Grid'5000

Navigation menu

Monitoring Using Kwollect: Difference between revisions

Latest revision as of 10:13, 26 August 2025

Metrics available

Getting metrics values

On-demand metrics

Visualization Dashboard

Pushing custom metrics

Known problems

Metrics available in Grid'5000

Navigation menu

Search