Monitoring Using Kwollect

From Grid5000
Jump to: navigation, search
Note.png Note

This page is actively maintained by the Grid'5000 team. If you encounter problems, please report them (see the Support page). Additionally, as it is a wiki page, you are free to make minor corrections yourself if needed. If you would like to suggest a more fundamental change, please contact the Grid'5000 team.

This page describes the monitoring service available in Grid’5000 that uses Kwollect to retrieve environmental and performance metrics from nodes.

The service currently provides metrics for:

  • Energy consumption from dedicated “wattmetre” devices (currently available for some clusters in Lyon, Grenoble, Nancy)
  • Metrics collected from nodes’ Board Management Controller ("BMC": out-of-band management hardware, such as Dell iDRAC), such as ambient temperature, hardware component temperature, energy consumption from PSU, fan speed, etc.
  • Trafic collected from network devices
  • Energy consumption from PDU, when available
Warning.png Warning

Monitoring with Kwollect under Grid'5000 is still in beta phase. It uses "sid" branch of the API, while legacy monitoring API (based on Ganglia and Kwapi) still uses the "stable" branch. Kwollect is intended to replace the legacy system in the future

Metrics available

The list of metrics available for a given Grid’5000 cluster is described in the Reference API, under the “metrics” entry of the cluster description. For instance, to get the list of available metrics for nodes of taurus cluster, you can use:

$ curl | jq .metrics

This returns a list where each element describes a single metric. Most important fields of that description are:

  • name: The name identifying the metric
  • description: A human-readable description of the metric
  • labels (optional): A “label” that will be used to distinguish two metrics of the same kind, but targeting different objects (e.g. temperature of CPU #1 vs temperature of CPU #2)
  • period and optional_period: The interval (in milli-seconds) under which the metric is collected. The former is the default interval, the latter is the interval used when “Full monitoring mode” is in use (see below). A metric with period value at 0 is not collected by default.

“Full monitoring mode” can be enabled for specific jobs by adding the “-t monitor” option to oarsub. Non-default metrics will be available for nodes belonging to this job. E.g:

$ oarsub -I -t monitor

Note: Dedicated wattmetre devices (metric “wattmetre_power_watt”) are able to perform one measurement every 20 milli-seconds. However, this high frequency is only provided under “Full monitoring mode”. By default, only the value averaged over one second is provided.

Getting metrics values

The metrics values are stored by Kwollect and available using the Metrology API (under “sid” version) by performing a GET request at URL:<site>/metrics

The following parameters are supported:

  • nodes: The list of nodes, separated by ‘,’ (comma), on which to obtain values
  • start_time: The time from which to obtain the values. By default it is one hour before the current time.
  • stop_time: The time until which to obtain the values. By default it is the current time.
  • job_id: A job identifier. This parameter is an alternative to the previous three as it will provide nodes list, start and stop time according to job characteristics.
  • metrics: The list of metrics name, separated by ‘,’ (comma), on which to obtain values (metrics name are described in Reference API, see above). By default all metrics are returned.

For instance, to:

  • get all metrics collected inside job 12345 at Lyon:
curl ''
  • get all metrics from chifflot-5 and chifflot-6 between 2020-06-08 15:00 and 2020-06-08 17:00:
curl ',chifflot-6&start_time=2020-06-08T15:00&stop_time=2020-06-08T17:00'
  • get all values from Wattmetre for taurus-2, during last 15 minutes:
curl "$(date -d '15 min ago' +%s)"

The request will return a JSON-formatted list of values, each containing following information:

  • timestamp: The time when the value has been collected
  • device_id: The identifier of the device (such as the node name) where the value has been collected
  • metric_id: The name of the metric for that value
  • value: The value collected for that metric
  • labels: Some optional additional information (see labels description above)

Known problems

  • Metrics from BMC are quite unreliable. They may be inaccurate, or unavailable on some nodes from time to time.
  • Some metrics are not available for every nodes of a cluster: For grcinq and hercule clusters, one every four nodes have more metrics available than the others ; on parasilo and paravance clusters, the measurement of the power consumption by PDUs is only available on some nodes
  • It may happen that few values of a metric are not collected quickly enough to comply with interval described in the Reference API (for instance, when the targeted device is overloaded).
  • Electrical consumption reported by PDU is not always reliable, see Power_Monitoring_Devices#measurement_artifacts_and_pitfalls