Traceable performance evaluation: Difference between revisions

From Grid5000
Jump to navigation Jump to search
 
(32 intermediate revisions by one other user not shown)
Line 1: Line 1:
This page presents good practices when performing benchmarks using Grid'5000 nodes. It is not about benchmarking Grid'5000 itself.
{{Portal|User}}
{{TutorialHeader}}
 
This page presents good practices when performing performance evaluations (benchmarks, ...) using Grid'5000 nodes. It is not about benchmarking Grid'5000 itself.
 
= About the Grid'5000 nodes' operating system =
= About the Grid'5000 nodes' operating system =
Grid'5000 nodes can be used with a choice of different system environments: either with the default (standard) environment which is pre-provisioned on nodes, or with a variety of environments deployed on demand by the user. We compare the several options.
Grid'5000 nodes can be used with a choice of different system environments: either with the default (standard) environment which is pre-provisioned on nodes or with a variety of environments deployed on demand by the user. We compare the several options.


== Benchmarking on node running the default (standard) environment ==
== Using a node running the default (standard) environment ==
In this case, access to the system of the node does not require a deployment (kadeploy), so access is quicker. The system however comes with a lot of functionalities that may impact benchmarks. It may be relevant to uninstall/deactivate them before running the benchmark.
In this case, access to the system of the node does not require a deployment (kadeploy), so access is quicker. The system however comes with a lot of functionalities that may impact benchmarks for instance. It may be relevant to uninstall/deactivate them before running a performance evaluation.


The list of services running on the system can be shown using <code class=command>systemctl</code>
The list of services running on the system can be shown using <code class=command>systemctl</code>. Then the system can be modified using <code class=command>sudo-g5k</code> to get the root privileges.


Services can be uninstalled (e.g. with <code class=command>apt-get remove </code><code class=replace>package</code>) or just deactivated (with <code class=command>systemctl stop </code><code class=replace>service</code>) after becoming root (with <code class=command>sudo-g5k</code>).
Services can be deactivated (with <code class=command>sudo-g5k systemctl stop </code><code class=replace>service</code>). More generally, software packages can be uninstalled (e.g. with <code class=command>sudo-g5k apt-get remove </code><code class=replace>package</code> → any associated services should be stopped, but mind checking).


The following services are known to possibly impact benchmarks:
* <code class=replace>prometheus-node-exporter.service</code> (Prometheus exporter for machine metrics) is typically known to cause power consumption spikes every 15s.
* prometheus-node-exporter.service: metrics polled every 15 seconds
* <code class=replace>dcgm-exporter.service</code> (NVIDIA DCGM prometheus exporter service), which is active (running) on machines equipped with GPUs, can also be unactivated.
* ...


Also, the Grid'5000 default environment includes some common system tweaks that may have biases on benchmarks. The [[CPU parameters]] page may provide relevant information.
Also, the Grid'5000 default environment includes some common system tweaks that may have biases on benchmarks. The [[CPU parameters]] page provides relevant information.


== Benchmarking on a deployed environment ==
== Using a deployed environment ==
Using kadeploy is very relevant for benchmarking, as it allows the user to operate the nodes with the most minimalistic system that is sufficient for the benchmark. For instance, one can deploy the <code class=replace>-min</code> variant of one of the Grid'5000 supported environments, or a customized environment possibly built on top of a minimalistic environment and featuring only software required for running the benchmark.
Using kadeploy is very relevant in the context of performance evaluation, as it allows a user to operate the nodes with the most minimalistic system that is sufficient for the use case. For instance, one can deploy the <code class=replace>-min</code> variant of one of the Grid'5000 supported environments, or a customized environment possibly built on top of a minimalistic environment and featuring only software required for running a benchmark.


{{Note|text=By design, the <code class=replace>-min</code> variant of the Grid'5000 supported environments do not change the default behavior of the Linux distribution it is based on.}}
{{Note|text=By design, the <code class=replace>-min</code> variant of the Grid'5000 supported environments do not change the default behavior of the Linux distribution it is based on.}}
Line 28: Line 31:
The following sources of information are available:
The following sources of information are available:
; [[News]]: The most notable changes in the infrastructure are the subject of news published on the [[News]] page as well as on the Grid'5000 twitter and sent to the users' mailing list (users@list.grid5000.fr). It is strongly recommended to follow the news.
; [[News]]: The most notable changes in the infrastructure are the subject of news published on the [[News]] page as well as on the Grid'5000 twitter and sent to the users' mailing list (users@list.grid5000.fr). It is strongly recommended to follow the news.
; [https://gitlab.inria.fr/grid5000/reference-repository Reference repository]: Detailed changes in the components of the infrastructure are recorded in the [https://gitlab.inria.fr/grid5000/reference-repository reference repository] (git). Updates of firmware versions for instance are tracked there.
; [https://gitlab.inria.fr/grid5000/environments-recipes/ Environment recipes]: Changes in the way the operating system of nodes is built are tracked in the [https://gitlab.inria.fr/grid5000/environments-recipes/ environment recipes] git repository.


= About the Grid'5000 metering probes =
; [https://gitlab.inria.fr/grid5000/environments-recipes/ Environment recipes]: Changes in the way the operating system of nodes is built are tracked in the environment recipes git repository: https://gitlab.inria.fr/grid5000/environments-recipes.
* It's a good idea to keep track of the version of the system running for an experiment, by storing the information provided in <code class=file>/etc/grid5000/release</code> for the standard environment or other Grid'5000 supported environment. That file contains the name of the environment suffixed with its version (as shown by the <code class=command>kaenv3</code> command on frontends), as well as the SHA ID of the git commit from which the environment was built. In the git repository, tags mark the commits used for building each version of the supported environments as well. Thus, older environments (used for a previous experiment) can be redeployed on nodes with <code class=command>kadeploy3</code> in order to help reproduce results.
* For Grid'5000 supported environments, it is also worth storing the environment postinstall (<code class=command>g5k-postinstall</code> command and version) that was used, which is provided in the <code class=file>/etc/grid5000/postinstall</code> file. The git repository of <code class=command>g5k-postinstall</code> is https://gitlab.inria.fr/grid5000/g5k-postinstall.
 
; Reference repository & Reference API: Detailed information and related changes in the components of the infrastructure are recorded in the reference repository which is served by the [[API_tutorial#Reference_API|Grid'5000 Reference API (REST API)]]. Updates of firmware versions for instance are tracked there. It is a good idea to keep track of the state of the platform at the time of the experiment as explained in [[API_tutorial#Platform_state_and_reproducibility]].
 
Finally, a new tool currently in ''alpha'' version allows for bundling infrastructure metadata information related to an experiment. See [[Grid5000 Metadata Bundler]] for more details.
 
= About the Grid'5000 node monitoring service =
Grid'5000 provides the [[Monitoring_Using_Kwollect|kwollect]] service that can collect information about nodes during a job: energy, network traffic, and more. See the [[Monitoring_Using_Kwollect|kwollect]] page or the [[Energy consumption monitoring tutorial]].
Grid'5000 provides the [[Monitoring_Using_Kwollect|kwollect]] service that can collect information about nodes during a job: energy, network traffic, and more. See the [[Monitoring_Using_Kwollect|kwollect]] page or the [[Energy consumption monitoring tutorial]].
{{Note|text=Be aware that if you deactivate the '''Prometheus exporter services''' on a node (as discussed above), some data will stop being gathered for that node in [[Monitoring_Using_Kwollect|kwollect]]: CPU usage, memory, etc. Environmental data gathered from network equipments or from the electrical infrastructure will still be collected.}}

Latest revision as of 15:47, 24 November 2021

Note.png Note

This page is actively maintained by the Grid'5000 team. If you encounter problems, please report them (see the Support page). Additionally, as it is a wiki page, you are free to make minor corrections yourself if needed. If you would like to suggest a more fundamental change, please contact the Grid'5000 team.

This page presents good practices when performing performance evaluations (benchmarks, ...) using Grid'5000 nodes. It is not about benchmarking Grid'5000 itself.

About the Grid'5000 nodes' operating system

Grid'5000 nodes can be used with a choice of different system environments: either with the default (standard) environment which is pre-provisioned on nodes or with a variety of environments deployed on demand by the user. We compare the several options.

Using a node running the default (standard) environment

In this case, access to the system of the node does not require a deployment (kadeploy), so access is quicker. The system however comes with a lot of functionalities that may impact benchmarks for instance. It may be relevant to uninstall/deactivate them before running a performance evaluation.

The list of services running on the system can be shown using systemctl. Then the system can be modified using sudo-g5k to get the root privileges.

Services can be deactivated (with sudo-g5k systemctl stop service). More generally, software packages can be uninstalled (e.g. with sudo-g5k apt-get remove package → any associated services should be stopped, but mind checking).

  • prometheus-node-exporter.service (Prometheus exporter for machine metrics) is typically known to cause power consumption spikes every 15s.
  • dcgm-exporter.service (NVIDIA DCGM prometheus exporter service), which is active (running) on machines equipped with GPUs, can also be unactivated.

Also, the Grid'5000 default environment includes some common system tweaks that may have biases on benchmarks. The CPU parameters page provides relevant information.

Using a deployed environment

Using kadeploy is very relevant in the context of performance evaluation, as it allows a user to operate the nodes with the most minimalistic system that is sufficient for the use case. For instance, one can deploy the -min variant of one of the Grid'5000 supported environments, or a customized environment possibly built on top of a minimalistic environment and featuring only software required for running a benchmark.

Note.png Note

By design, the -min variant of the Grid'5000 supported environments do not change the default behavior of the Linux distribution it is based on.

The recipes from which Grid'5000 environment are built are available in https://gitlab.inria.fr/grid5000/environments-recipes. See Environment creation for more information about how environments are built.

About the traceability of the infrastructure

Special effort is made in Grid'5000 to keep track of infrastructure changes so that any user can get information about for instance what side effects could have impacted an experiment.

The following sources of information are available:

News
The most notable changes in the infrastructure are the subject of news published on the News page as well as on the Grid'5000 twitter and sent to the users' mailing list (users@list.grid5000.fr). It is strongly recommended to follow the news.
Environment recipes
Changes in the way the operating system of nodes is built are tracked in the environment recipes git repository: https://gitlab.inria.fr/grid5000/environments-recipes.
  • It's a good idea to keep track of the version of the system running for an experiment, by storing the information provided in /etc/grid5000/release for the standard environment or other Grid'5000 supported environment. That file contains the name of the environment suffixed with its version (as shown by the kaenv3 command on frontends), as well as the SHA ID of the git commit from which the environment was built. In the git repository, tags mark the commits used for building each version of the supported environments as well. Thus, older environments (used for a previous experiment) can be redeployed on nodes with kadeploy3 in order to help reproduce results.
  • For Grid'5000 supported environments, it is also worth storing the environment postinstall (g5k-postinstall command and version) that was used, which is provided in the /etc/grid5000/postinstall file. The git repository of g5k-postinstall is https://gitlab.inria.fr/grid5000/g5k-postinstall.
Reference repository & Reference API
Detailed information and related changes in the components of the infrastructure are recorded in the reference repository which is served by the Grid'5000 Reference API (REST API). Updates of firmware versions for instance are tracked there. It is a good idea to keep track of the state of the platform at the time of the experiment as explained in API_tutorial#Platform_state_and_reproducibility.

Finally, a new tool currently in alpha version allows for bundling infrastructure metadata information related to an experiment. See Grid5000 Metadata Bundler for more details.

About the Grid'5000 node monitoring service

Grid'5000 provides the kwollect service that can collect information about nodes during a job: energy, network traffic, and more. See the kwollect page or the Energy consumption monitoring tutorial.

Note.png Note

Be aware that if you deactivate the Prometheus exporter services on a node (as discussed above), some data will stop being gathered for that node in kwollect: CPU usage, memory, etc. Environmental data gathered from network equipments or from the electrical infrastructure will still be collected.