G5k-checks: Difference between revisions

From Grid5000
Jump to navigation Jump to search
mNo edit summary
 
(106 intermediate revisions by 15 users not shown)
Line 1: Line 1:
{{See also| [[G5k-checks]] | [[New_G5k-checks | New G5k-checks ]] | [[Node qualification]]}}
{{Maintainer|Emile Morel}}
{{Maintainer|Philippe Robert}}
{{Author|Emile Morel}}
{{Author|Philippe Combes}}
{{Author|Philippe Robert}}
{{Portal|Admin}}
{{Portal|Admin}}
{{Portal|User}}
{{Portal|Service}}
{{Portal|Service}}
{{Status|Open for comment}}
{{Status|Open for comment}}


= Get the sources =


Download release archives at https://gforge.inria.fr/frs/?group_id=150
= Description =
 
== Overview ==
 
* g5k-checks is expected to be integrated into the standard environment of the Grid'5000 computational nodes. It checks that a node meets several basic requirements before it declares itself as available to the OAR server.
* This lets the admins enable some checkers which may be very specific to the hardware of a cluster.
 
== Architecture ==
 
G5kchecks is based on rspec test suite. Rspec is a little bit roundabout of it first mission: test a program. We use rspec to test all node characteristics. The first step is to retrieve node informatation with ohai. By default ohai provides a large set of characteristics of the machine. Added to this, we have developed some plugins to complete missing information (particularly for the disk, the cpu and the network). The second step is to compare those characteristics with the grid5000 [[TechTeam:Reference_Repository|Reference_Repository]]. To do that, g5kchecks takes each value of the API and compares them with the values given by ohai. If those values don't match, then an error is thrown via the rspec process.
 
== OAR ==
 
systemd is used to start OAR on nodes, a single service is in charge of managing the sshd daemon (to start or stop it) and launching the oar-node script (<code class="command">/etc/oar/oar-node-service</code>) with an argument (which must be 'start' or 'stop').
 
* The oar-node flavour of OAR installation <code class="command">/etc/oar/oar-node-service</code> launches <code class="command">/usr/lib/oar/oarnodecheckrun</code>, which then runs the executable file <code class="command">/etc/oar/check.d/start_g5kchecks</code>. The OAR server periodically invokes remotely <code class="command">/usr/bin/oarnodecheckquery</code>. This command returns with status 1 if <code class="command">/var/lib/oar/check.d/</code> is not empty, 0 otherwise. So if <code class="command">/etc/oar/check.d/start_g5kchecks</code> finds something wrong, it simply has to create a log file in that directory.
* If <code class="command">oarnodecheckquery</code> fails, then the node is not ready to start, and it loops on running those scripts until either <code class="command">oarnodecheckquery</code> returns 0 or a timeout is reached. If the timeout is reached, then it does not attempt to declare the node as "Alive".
 
This summarizes when g5kchecks is run:
* At oar-node service start with <code class="command">/etc/oar/oar-node-service</code>
* Between (non-deploy) jobs with remote execution of <code class="command">oarnodecheckrun</code> and <code class="command">oarnodecheckquery</code> (In case of deploy jobs, the first type of execution takes place)
* Launched by user manually (for now, never happens)
 
G5kchecks is never run during users jobs.
 
= Checks Overview =
 
The following values are checked by g5k-checks:
<pre>
 
# Generated by g5k-checks (g5k-checks -m api)
---
network_adapters:
  bmc:
    ip: 172.17.52.9
    mac: 18:66:da:7c:96:1a
    management: true
  eno1:
    name: eno1
    interface: Ethernet
    driver: tg3
    mac: 18:66:da:7c:96:16
    rate: 0
    firmware_version: FFV20.2.17 bc 5720-v1.39
    model: NetXtreme BCM5720 Gigabit Ethernet PCIe
    vendor: Broadcom
    mounted: false
    management: false
  eno2:
    name: eno2
    interface: Ethernet
    driver: tg3
    mac: 18:66:da:7c:96:17
    rate: 0
    firmware_version: FFV20.2.17 bc 5720-v1.39
    model: NetXtreme BCM5720 Gigabit Ethernet PCIe
    vendor: Broadcom
    mounted: false
    management: false
  eno3:
    name: eno3
    interface: Ethernet
    driver: tg3
    mac: 18:66:da:7c:96:18
    rate: 0
    firmware_version: FFV20.2.17 bc 5720-v1.39
    model: NetXtreme BCM5720 Gigabit Ethernet PCIe
    vendor: Broadcom
    mounted: false
    management: false
  eno4:
    name: eno4
    interface: Ethernet
    driver: tg3
    mac: 18:66:da:7c:96:19
    rate: 0
    firmware_version: FFV20.2.17 bc 5720-v1.39
    model: NetXtreme BCM5720 Gigabit Ethernet PCIe
    vendor: Broadcom
    mounted: false
    management: false
  enp5s0f0:
    name: enp5s0f0
    interface: Ethernet
    ip: 172.16.52.9
    driver: ixgbe
    mac: a0:36:9f:ce:e4:24
    rate: 10000000000
    firmware_version: '0x800007f5'
    model: Ethernet 10G 2P X520 Adapter
    vendor: Intel
    mounted: true
    management: false
  enp5s0f1:
    name: enp5s0f1
    interface: Ethernet
    driver: ixgbe
    mac: a0:36:9f:ce:e4:26
    rate: 0
    firmware_version: '0x800007f5'
    model: Ethernet 10G 2P X520 Adapter
    vendor: Intel
    mounted: false
    management: false
operating_system:
  ht_enabled: true
  pstate_driver: intel_pstate
  pstate_governor: performance
  turboboost_enabled: true
  cstate_driver: intel_idle
  cstate_governor: menu
architecture:
  platform_type: x86_64
  nb_procs: 2
  nb_cores: 16
  nb_threads: 32
chassis:
  serial: 7W26RG2
  manufacturer: Dell Inc.
  name: PowerEdge R430
main_memory:
  ram_size: 68719476736
supported_job_types:
  virtual: ivt
bios:
  vendor: Dell Inc.
  version: 2.2.5
  release_date: '09/08/2016'
processor:
  clock_speed: 2100000000
  instruction_set: x86-64
  model: Intel Xeon
  version: E5-2620 v4
  vendor: Intel
  other_description: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
  cache_l1i: 32768
  cache_l1d: 32768
  cache_l2: 262144
  cache_l3: 20971520
  ht_capable: true
storage_devices:
  sda:
    device: sda
    by_id: "/dev/disk/by-id/wwn-0x6847beb0d535ed001fa67d1a12d0d135"
    by_path: "/dev/disk/by-path/pci-0000:01:00.0-scsi-0:2:0:0"
    size: 598879502336
    model: PERC H330 Mini
    firmware_version: 4.26
    vendor: DELL
 
</pre>
 
This is an example of output file in API mode (g5k-checks launched with -m api option).


Get the latest development from the Grid'5000 Subversion repository:
In addition, not all tests are exporting data in this file. The following values are also checked:
* via SSH,
svn checkout svn+ssh://<developer username>@scm.gforge.inria.fr/svn/grid5000/admin/trunk/g5k-checks
* via DAV,
svn checkout --username <developer username> https://scm.gforge.inria.fr/svn/grid5000/admin/trunk/g5k-checks


= Description =
* Grid5000 standard environment version
* Grid5000 post-install scripts version
* Usage of sudo-g5k (failed if used, could be destructive to other parts of the system)
* Correct mode of /tmp/
* Fstab partitions mounted and valids
* All partitions have expected size, position, offset, mount options, ...
* Correct KVM driver
 
= Simple usage =
== Installation ==
 
G5kchecks is currently tested on Debian buster. On grid5000 debian repository, just add on /etc/apt/sources.list
deb http://packages-ext.grid5000.fr/deb/g5k-checks/buster /
 
{{Term|location=node|cmd=<code class="command">apt-get</code> update }}
Install it:
{{Term|location=node|cmd=<code class="command">apt-get</code> install g5kchecks}}
 
== Get sources ==
 
<code> git clone https://github.com/grid5000/g5k-checks.git </code>
 
== Run g5k-checks ==
 
If you want to check your node just run:
  {{Term|location=node|cmd=<code class="command">g5k-checks -v</code>}}
 
The output should highlight tests in error in red. Also, if some error occured, g5k-checks puts file in /var/lib/g5kchecks/. For instance:
 
  root@adonis-3:~# g5k-checks
  root@adonis-3:~# ls /var/lib/oar/checklogs/
  OAR_Architecture_should_have_the_correct_number_of_thread
 
You can see the detail of the values checked this way:
 
  root@'''adonis-3''':~# cat /var/lib/oar/checklogs/OAR_Architecture_should_have_the_correct_number_of_thread
 
== Get node description ==
 
G5k-checks has a double utility. It can check a node description against our reference API and detect errors.
But it can also generate the data to populate this reference API.
 
If you want a exact node description you can run:
 
  {{Term|location=node|cmd=<code class="command">g5k-checks</code> -m api}}
 
(If launched with -v verbose mode, you can see that almost all tests are failing and it is normal as empty values are checked instead of real ones)
 
Then g5k-checks put a json and a yaml file in /tmp/
 
  root@adonis-3:~# g5k-checks -m api
  root@adonis-3:~# ls /tmp/
  adonis-3.grenoble.grid5000.fr.json  adonis-3.grenoble.grid5000.fr.yaml
 
= Write your own checks/description =
 
== G5k-checks internal ==
G5k-checks is written in ruby on top of the rspec test framework. It gathers informations from ohai program and compare them with grid'5000 reference API data. Rspec is simple to read and write, so you can copy easily other checks and adapt them to your needs.
 
File tree is:
 
  ├── ohai # Ohai plugins, those informations are use by g5k-checks after
  ├── rspec # Add Rspec formatter (store informations in different way)
  ├── spec # Checks directory
  └── utils # some useful class
 
== Play with ohai ==
 
[http://docs.opscode.com/ohai.html Ohai] is a small program who retrieve information from different files/other program on the host. It offers an easy to parse output in Json. We can add information to Json just by writing plugins. For instance if we want to add the version of bash in the description, you can create a small file /usr/lib/ruby/vendor_ruby/g5kchecks/ohai/package_version.rb with:
 
<pre>
Ohai.plugin(:Packages) do
 
  provides "packages"
 
  collect_data do
      packages Mash.new
      packages[:bash] = `dpkg -l | grep bash | awk '{print $3}'`
      packages
  end
end
</pre>
 
== Play with Rspec ==
 
[http://rspec.info/ Rspec] is a framework for testing ruby programs. G5k-checks use Rspec, not to test a ruby program, but to test host. Rspec is simple to read and write. For instance if we want to ensure that bash version is the good one, you can create a file /usr/lib/ruby/vendor_ruby/g5kchecks/spec/packages/packages_spec.rb with :
 
  describe "Packages" do
                                                                                                                                           
    before(:all) do                                                                                                                       
      @system = RSpec.configuration.node.ohai_description
    end
   
    it "bash should should have the good version" do                                                                                       
      puts @system[:packages][:bash].to_yaml
      bash_version = @system[:packages][:bash].strip                                                                                       
      bash_version.should eql("4.2+dfsg-0.1"), "#{bash_version}, 4.2+dfsg-0.1, packages, bash"                                             
    end
       
  end
 
== Add checks ==
 
Example: I want to check if flag "acpi" is available on the processor:
 
Add to /usr/lib/ruby/vendor_ruby/g5kchecks/spec/processor/processor_spec.rb:


== g5k-checks ==
  it "should have apci" do
    acpi_ohai = @system[:cpu][:'0'][:flags].include?('acpi')
    acpi_ohai.should_not be_false, "#{acpi_ohai}, is not acpi, processor, acpi"
  end


* g5k-checks is expected to be integrated into the production environment of the Grid'5000 computational nodes. It gathers a collection of programs which check that a node meets several basic requirements before it declares itself as available to the OAR server.
== Add informations in description ==
* This lets the admins enable some checkers which may be very specific to the hardware of a cluster.


'''g5k-checks executes at boot time in two phases:'''
Example: I want to add bogomips of node:


* Phase 1
First we should add information in ohai description. To do this we add in the file ohai/cpu.rb after line 80:
** An init script, /etc/init.d/g5k-checks, runs all checkers that must run early enough in the boot process.
** They are listed in the variable CHECKS_FOR_INIT in the configuration file.
** Then it enables all checkers listed in the variable CHECKS_FOR_OAR for Phase 2.
* Phase 2
** This phase strongly relies on the check mechanism provided by OAR and the oar-node configuration file (/etc/default/oar-node for deb distros, /etc/sysconfig/oar-node for rpm ones).
** The oar-node flavour of OAR installation embeds an hourly cron job, <code class="command">/usr/lib/oar/oarnodecheckrun</code>, which runs all executable files stored in <code class="command">/etc/oar/check.d/</code>. Then the server periodically invokes remotely <code class="command">/usr/bin/oarnodecheckquery</code>. This command returns with status 0 if there is some files in <code class="command">/var/lib/oar/check.d</code> and 0 otherwise. So if a checker in /etc/oar/check.d/ finds something wrong, it simply has to create a log file in that directory.
** The version of /etc/(default|sysconfig)/oar-node that g5k-checks installs runs both <code class="command">oarnodecheckrun</code> and <code class="command">oarnodecheckquery</code> scripts. If the latter fails, then the node is not ready to start, and it loops on running those scripts until either <code class="command">oarnodecheckquery</code> returns 0 or a timeout is reached. If the timeout is reached, then it does not attempt to declare the node as "Alive".
** During Phase 1, the enabling of a checker simply turns out to adding a symbolic link in /etc/oar/check.d to its "oar-node driver". We name this a short script file which interfaces the core checker to the OAR check mechanism.


At any moment when the node is running g5k-checks may be called either to disable or to enable checks. This is expected to be used by OAR prologue and epilogue:
<pre>
* /etc/init.d/g5k-checks stop:    disable OAR checks
    if line =~ /^BogoMIPS/
* /etc/init.d/g5k-checks start:    enable OAR checks for oarnodecheckrun
      cpu[:Bogo] = line.chomp.split(": ").last.lstrip
* /etc/init.d/g5k-checks startrun: enable OAR checks for oarnodecheckrun and run once the couple oarnodecheckrun/oarnodecheckquery, without waiting for one hour to be passed.
     end
</pre>


Basically, the OAR prologue should call "stop", while the epilogue should call "startrun".
Then we can retrieve information and add it to the description. To do this we add in /usr/lib/ruby/vendor_ruby/g5kchecks/spec/processor/processor_spec.rb:
<pre>
    it "should have BogoMIPS" do
      bogo_ohai = @system[:cpu][:Bogo]
      #First value is system, second is from API, thirs is the YAML path in the created '/tmp/' file for -m api mode.
      #Last argument is false to export value in API mode, true to skip
      Utils.test(bogo_ohai, nil, 'processor/bogoMIPS', false) do |v_ohai, v_api, error_msg|
          expect(v_ohai).to eql(v_api), error_msg
      end
    end
</pre>


* At installation time, g5k-checks configures the local syslog daemon: it first looks for a free <n> such as local<n>.alert and local<n>.warning selectors are not used, and then defines them with the action "@syslog". If there is no "syslog" host on the local network, then it defaults to writing messages in the local syslog file.
Now you have the information in /tmp/mynode.mysite.grid5000.fr.yaml:


* The checkers use the local<n> facility to report only important messages. They use a local log file for debugging messages. Please see section 3.2 for further details.
    root@graphene-100:/usr/lib/ruby/vendor_ruby/g5kchecks# g5k-checks -m api
    root@graphene-100:/usr/lib/ruby/vendor_ruby/g5kchecks# grep -C 3 bogo /tmp/graphene-100.nancy.grid5000.fr.yaml
      ram_size: 16860348416
    processor:
      clock_speed: 2530000000
      '''bogoMIPS''': 5053.74
      instruction_set: x86-64
      model: Intel Xeon
      version: X3440


== g5k-parts ==
= Releasing and testing =


'''g5k-parts is designed to run at both phases of g5k-checks''' (see above).
== Tests and reference-repository update ==


* In Phase 1, g5k-parts validates the partitioning of a Grid'5000 computational node against the G5K Node Storage convention: all partitions but /tmp are primary, and /tmp is a logical partition inside the only extended partition.
Before creating a new standard environment, g5k-checks can be tested on target environments using the jenkins test: https://intranet.grid5000.fr/jenkins/job/test_g5kchecksdev
* It first compares /etc/fstab with its backup generated at deployment time. When errors are found at this level, /etc/fstab is reset and the machine reboots.
* Then for every partition given on the command line, it first matches its geometry on the hard drive with the partition layout saved at deployment time. It may perform several other checks (e.g. the formatting of the partition) depending on the partition role. It attempts to fix errors, which it reports to the syslog system. Sometimes, g5k-parts cannot fix the error (e.g. hard drive errors); it can only prevent the node from declaring itself as alive to the OAR server, with a simple stamp file (not executable !) whose existence is tested by the oar-node driver of g5k-parts in Phase 2:
/etc/oar/check.d/g5k-parts-init-failed


* There is a special processing for NFS shares which the node must mount at boot time. Sometimes such mounts fail and this is not due to the node itself but to the NFS server(s) or the network connection. To avoid blocking the boot process (and have kadeploy3 fail because of a timeout when it should not), g5k-parts tries every mount only once. Subsequent tries are done in Phase 2.
This test can reserve all or the maximum possible nodes (targets cluster-ALL and cluster-BEST) on each cluster of Grid5000.
* In Phase 2, the oar-node driver of g5k-parts calls the script with a single argument, "nfs", in order to limit the checks to the NFS shares.


== g5k-proc ==
It will checkout a (configurable) branch of g5k-checks and test it against a (configurable) branch of the reference-api.


* g5k-proc aims at verifying if the characteristics of the processor model and the processor frequency are those expected.
The test will fail if mandatory test fails (i.e. there are entries in /var/lib/oar/checklogs).
* To perform this test, g5k-proc is based on g5k-api-lib.rb which parses the reference API to retrieve information about the processor model and the processor frequency.
* Then, it compares the retrieved information to the contents of /proc/cpuinfo.


== g5k-ethernet-link ==
Also, the Yaml output of the "-m api" option of g5k-checks will be written to ''$HOME/g5k-checks-output'' directory of the ''ajenkins'' user on the target site.


* g5k-ethernet-link checks whereas the ethernet link has been correctly negotiated or not.
Note: it is possible to change the branches of both reference-repository and g5k-checks for the test by configuring the jenkins test:
* Like g5k-proc, it parses the reference API and retrieves informations about the Ethernet devices such as the rate.
* Then, it makes a comparison between the retrieved value and the output of <code class="command">ethtool</code>.
* This checker could be enhanced to manage the InfiniBand, Myrinet or still the second ethernet interface if this one is configured in the postinstall of the production environment.


= Installation =
<pre>
  cd /srv/jenkins-scripts && ruby -Ilib -rg5kcheckstester -e "G5KChecksTester::new.test('$site_cluster', 'custom', 'dev_feature', 'dev_feature_refrepo')"
</pre>


== Production environment ==
For example, this will take the 'dev_feature' branch of g5kcheck and test it against the data present in the 'dev_feature_refrepo' branch of the reference-api.


'''Dependencies'''
===== Updating the reference-repository =====
* deb distributions
** ethtool
** rubygems + run "gem install rest-client"
* rpm distributions
** ethtool
** rubygem-mime-types
** rubygem-rest-client


'''Check that the variables defined in Makefile.defs do  match your needs.'''
Once the tests are finished on the desired clusters, generated Yaml files must be imported manually.


<code class="command">make DISTRIB=(rpm|deb) install</code>
* In the reference repository, go in the ''generators/run-g5kchecks'' directory.
* Now get yaml files you want to include. For example:
<pre>
rsync -va "rennes.adm:/home/ajenkins/g5k-checks-output/paravance*.yaml" ./output/
</pre>
The ''output'' directory hold the temporary files that will be included as ''input'' in the reference-repository.
* Then import YAML files into the reference-repository with:
<pre>
rake g5k-checks-import SOURCEDIR=<path to the output dir>


'''Check that the syslog configuration which was generated is correct.'''
If values seem correct, generate JSON and commit:
<pre>
  rake reference-api
  git diff data/
  git add data input
  git commit -m"[SITE] g5k-checks updates"
</pre>


== Postinstalls of the Production Environment ==
== Release a new version ==


Please read carefully the instructions printed out by the make command above.
Once modifications are tested correct on a maximum of clusters, a new version can be released.


= Adding a new Check =
See '''[[TechTeam:Git_Packaging_and_Deployment#D.C3.A9tails_du_workflow_de_release_.28et_configuration_de_gitlab-ci.29|here]]''' for general instructions about the release workflow.


In this section we assume that you want to add a check named "new-check". First create the new-check/ directory at the root of the source tree.
== Environment update ==


== Simple Case ==
The version of g5k-checks included in standard environment is defined in the following file:
''steps/data/setup/puppet/modules/env/manifests/common/software_versions.pp''


In the "simple" case, there are only three files involved:
Once the environment is correct and its version updated, it can be generated with the automated jenkins job:  
* bin/new-check,    the core checker
https://intranet.grid5000.fr/jenkins/job/env_generate/
* check.d/new-check, the oar-node driver
* Makefile:


topdir = ..
== New environment release and reference-api update guidelines ==
CHECK = <code class="command">new-check</code>
include $(topdir)/Makefile.check


The core checker is the core part of the check. It can be a bash or a ruby script, but if it needs information from the G5K API, then you had better develop it in ruby, for the ruby library g5k-api-lib.rb provides all facilities to retrieve them.
The following procedure summarizes the steps taken to test and deploy a new environment with g5k-checks.


The core checker MUST use the logging functions provided by the bash and ruby libraries with the conventions listed in section 3.2.
G5k-checks relies on the reference-api to check system data against it. Data from the reference-api must be up-to-date for tests to succeed but most of this data is generated by g5k-checks itself, creating a sort of 'circular dependency'.
To avoid dead nodes, g5k-checks data from all nodes should be gathered before pushing a new environment.


The oar-node driver interfaces the call to the check with the oar-node check mechanism. It is called without any argument but with an environment variable set: CHECKLOGFILE. It is the path of a file in the directory that oarnodecheckquery will search into. The oar-node driver should create it if the check fails. Just look at some examples in the check.d/ directories of existing checks.
* Do a reservation of all nodes of G5K, for example: ''oarsub -t placeholder=maintenance -l nodes=BEST,walltime=06:00 -q admin -n 'Maintenance' -r '2017-08-31 09:00:00'''


== Logging ==
The reservation should happen early enough to ensure most (ideally all) of the resources will be available at that time.


The core checker MUST use the logging functions provided by g5k-api-lib.rb and libg5kchecks.bash, with the following convention:
* Prepare and release a new debian package of g5k-checks (see [[#Release a new version]])
* <code class="command">log A, "msg"</code>: send msg as an alert to the syslog system
* <code class="command">log W, "msg"</code>: send msg as a warning to the syslog system
* <code class="command">log I, "msg"</code>: send msg to a local log file, /var/log/g5k-checks.log
$LOG may be used to redirect the stdout and/or stderr of some commands.


* The alert level must be used for errors which prevent the node from being qualified for OAR reservation. An admin must have a look.
* Prepare a new standard environment with this new g5k-checks version (see [[#Environment update]])
* The warning level must be used when the error is severe enough for admins to be interested in having a look at it, but which does not disqualify the node for OAR reservation: typically, when the checker can fix it.
* Alert and Warning level messages should not be longer than one-line, as far as possible.
* The information level is intended to give a complement of information to the admin who would have a look: be free to log any kind of information there. At this level, you can use any of the following methods:
** through the function: log I "msg"
** directly into $LOG: e.g. >> ${LOG} 2>&1
'''BEWARE always to APPEND text, and NOT ERASE preceding information !!!'''


Finally, note that your checker may be called every hour by the oar-node check mechanism, so try to make it as silent as possible when it is run from the oar-node driver...
* Now g5k-checks should be run on all reserved nodes in 'api' mode in order to retrieve the yaml description that will be used to update the reference-api.
This step might be the most tedious one but can be done before the actual deployment.
See [[#Tests and reference-repository update]]


== Complex Case ==
* Commit and push theses changes to ''master'' branch of the reference-repository


Some checks may need to run earlier in the boot sequence than oar-node. Just like for the oar-node driver of the simple case, you must develop an "init" driver and move it to new-check/check.d/new-check. /etc/check.d/g5k-checks will call this driver with no argument before it enables the oar-node checks (see 1.1) For instance, g5k-parts needs both an init and an oar-node driver.
* Soon after, push new environment version to all sites using the automated jenkins job: https://intranet.grid5000.fr/jenkins/job/env_push/
The jenkins job does a oar reservation of type 'destructive' that will force the deployment of the new environment.


Some checks may need more data than those set in the global g5k-checks configuration file. Those data shall be stored in $G5KCHECKS_HOME/data/ after installation. Their installation (and their generation if needed) shall be managed by new-check/Makefile. The Makefile of the simple case will not fit, of course. The Makefile you will develop shall meet the only requirement that the following targets must be defined (even if no command is associated):
* If not all nodes were available at the time of the new g5k-checks data retrieval (which is often the case) or during environment update, open a {{Bug|}} for all sites to let site administrators finish running g5k-checks on remaining nodes.
          install:
          install-msg:
          clean:
          uninstall:
          uninstall-msg:
Once again, g5k-parts is a good example: its Makefile is based on the simple case Makefile, but add requisites to the compulsory targets... Note that it makes a full usage of the variables defined in Makefile.defs too.


== Librairies ==
== Run G5k-checks on non-reservable nodes ==


* For checks that need to retrieve the configuration of the nodes, the script /lib/g5k-api-lib.rb establishes a connection to the api-proxy and load this configuration in the file /data/node_configuration.
It is common to update the reference-repository values of nodes whose state are 'Dead' on OAR.
** When this file ever exists, there is no need to draw this connection. Then, the file will be read and no connection will be performed.
** As following, the connection to the api-proxy will only be done while deploying the production environment.
** If the API-proxy is not reachable, /data/node_condiguration will be used
** If the API-proxy is not reachable and this file does not exist, the node declares itself as suspected


If you find some code duplicated in several checkers, then feel free to move it to the lib/ directory, either in the existing files if it makes sense, or in a new library. Only make sure the checker will load it...
An adaptation of the jenkins g5k-checks test has been made to allow running the same test without doing a OAR reservation.


= Discussions @CT-60 =
The only difference is that instead of using OAR to reserve nodes and Kadeploy API to deploy, the nodes are given directly as arguments and kadeploy is called directly from site's frontends.
(and later...)


The feed-back from the team at CT-60 and some further discussions with B. Bzeznik led to some concrete propositions. The decisions which remain to be made are in italic.
This scripts must be run on the jenkins machine:


: {{No}}: still waiting for a decision. {{Inprogress}}: decision made, implementation in progress. {{Yes}}: decision made and implementation done in SVN repository.
''cd /srv/jenkins-scripts ''


; {{Yes}} oarnodecheckrun should not rely on cpusets to find running jobs
''ruby -Ilib -rg5kcheckstester -e "G5KChecksTester::new.from_nodes_list()" grisou-{15,16,18}.nancy.grid5000.fr''
: <font color=red>'''FALSE'''</font> even with <code>-t allow_classic_ssh</code> the cpusets are defined: it is performed by the job_resource_manager, not by the ssh connection.


; {{Yes}} The node should not be suspected if the API server is not available.
Once done, the procedure is the same as described in [[#Updating the reference-repository]].
: Pascal M. said it was too constraining to have the API server running when deploying the nodes. It is already fixed in the SVN repository: a simple warning is issued instead of an alert, and the check is skipped, which needed this.


; {{Yes}} The API branch sould be configurable.
== Run/fix g5k-checks on a new node of a cluster that is not fully operational yet ==
You may have to work on adapting g5k-checks for a new type of node of a cluster that is not fully integrated in Grid'5000, e.g. not in the API and kadeploy yet.


; {{Yes}} The HTTP cache mechanisms should be used to retrieve the node configuration from the API server.
Even if kadeploy is not fully operational, you should have a "std" environment running on the node, in order to be as close as possible to what g5k-checks expects in terms of OS.
: Philippe R. is in charge of that (with Cyril R.'s help ?) The current implementation relies on the local copy existence only, with no consideration of expiration date.


; {{Yes}} g5k-checks.conf should be managed by the admin postinstall.
Then, you can try and run <code class=command>g5k-checks -m api</code> in order to generate a yaml description of the node (yaml file generated in /tmp). Of course for a new node, this may not work out of the box: adaptation of g5k-checks may be needed at that first stage already.
: This is a bit annoying, because installing g5k-checks already requires modifications of the environment postinstall (for g5k-parts). But it has been decided that the environment postinstall should not depend on the site. g5k-checks.conf sets the list of checks to be run, which may depend on the cluster. Of course, we g5k-checks developers will ensure that the default installed version of g5k-checks.conf is one suitable for most of clusters. But "exotic" clusters will still require a customized g5k-checks.conf in their admin postinstall (only for production environments !!!)


; {{Yes}} /tmp remounted RO
Then, you can add the node description in the reference API, using <code class=command>rake g5k-checks-import=DIR</code> in a test branch (e.g. named TESTBRANCH), and do the necessary to get that branch published in the API:
: g5k-parts tackles this issue at boot time. It now check for it (actually all tests about /tmp are performed) along with the periodical checks for NFS mounts.
rake valid:schema
rake valid:duplicates
rake reference-api


; {{No}} oarnodecheckrun should set a timeout on the run of every check
Now you should be able to fetch the API info for the node using <code class=command>curl -s "https://api.grid5000.fr/stable/sites/toulouse/clusters/NEWCLUSTER/nodes/NEWCLUSTER-N?branch=TESTBRANCH" | jq</code>
: Not that easy in bash, but we found a solution. Now there still remain a few things to decide.
: ''What if a check times out ?'' David M. thinks oarnodecheckrun should kill what it can and report the error. Philippe C. thinks only one operation should be performed: the creation of the checklog file, which tags the node as suspectable for oarnodecheckquery. There should be no attempt to kill the check process, and no attempt to run the next checks. Reasons for that are:
# oarnodecheckrun does not know if any fork or exec has been performed by the checker, and it could be a mess to correctly kill it.
# the check could still succeed in the end: the OAR driver of the check would then remove all checklog files, and the next oarnodecheckquery execution would succeed, setting the node back to alive.
# we do not want checker developers to bother about possible concurrent execution with another check.
: The control of the timeout will be given to oarnodecheckrun. The OAR drivers may define a TIMEOUT variable, which oarnodecheckrun would take as a replacement of its default value, 30s.


; {{Inprogress}} The pingchecker at job end is called for deploy jobs too, which have the nodes systematically suspected.
On the node, you should now be able to test g5k-checks against the API information. To do so, first modify <code class=file>/etc/g5k-checks.conf</code> to set the branch to TESTBRANCH. Then run <code class=command>g5k-checks -k -v</code> (-k disables the kadeploy status check).
: Bruno B. will ensure the pingchecker is not called on Absent or scheduled-to-be-Absent nodes. Since the frontend epilogue requests for the job nodes to be set Absent, the job-end-pingchecker will not be called on them. This is fixed in OAR SVN repository (see the [https://gforge.inria.fr/tracker/index.php?func=detail&aid=10658&group_id=125&atid=586 bug entry] in the OAR bug tracker system).


; {{Inprogress}} Problem of the pingchecker timeout.
At this stage again, adapting g5k-check may be needed, possibly directly on the node for quick tests.  
: In the current implementation, oarnodecheckquery would launch the checks if they had not been run for too long. This may happen at job end actually. The problem is that it requires the checks to run within the pingchecker timeout, or we would need a different timeout for job end pingchecker, or even better, another pingchecker for job end. These solutions sound like too tricky, too G5K-specific to OAR developers ears. In the meanwhile, we realized that it was eventually possible (and reliable) to run the checks in the job epilogue on the first node of the job. So the new scheme will be the following:
# oarnodecheckquery is reset to its simple form, only checking for any checklog file existence. It is the OAR pingchecker, called every 5 minutes AND at the end of non-deploy jobs.
# oarnodecheckrun is still launched hourly by cron and run the checks if only no job at all is running on the node.
# The OAR epilogue on the first node will "taktuk" <code>oarnodecheckrun &lt;JOBID&gt; &lt;JOBUSER&gt;</code> on all nodes of the reservation, then exit 0 to avoid setting the job state as ERROR. If some checks fail on some nodes, then the job end pingchecker will suspect those nodes and not all nodes of the job. The additional arguments to oarnodecheckrun have it accept running the checks if the specified job is the only one remaining on the node.
: This solution gets rid of any stamp file.


; Discussions about the framework
Fix in git in a test branch: once the CI is ok, packages are available as artifacts in the CI pipeline, which can be installed on the node with 'dpkg -i ...' initially.
: We g5k-checks developers think RSpec is not well suited for that software, it looks more designed for software regression/test suites. Furthermore, we discovered this solution a bit late, when g5k-checks current framework was already set. But this is no good reason, of course.
: The RSpec framework could suit the hourly checks, but it does not tackle all aspects of g5k-checks, which may run checks at boot time, just like g5k-parts.
: We found that solution tough for Ruby newbies like us. It is quite constraining and makes it difficult to integrate non-Ruby checkers (like g5k-parts again). The main idea of g5k-checks is that people may develop some tests on their own. Then, when their test programs are mature enough, they could be easily integrated into g5k-checks, as a standalone checker. The RSpec framework would discourage this, to our opinion.
: Finally, we have been charged of "reinventing the wheel". But g5k-checks MUST interact with OAR. It would definetely have been "reinventing the wheel" if we had ignored the oarnodecheck mechanism. Then, the oarnodecheck mechanism is already a test framework as a whole. So what ? Should have we used a framework inside another framework ?
: Yes, the RSpec framework is very attractive, but we must be careful not to fall into this common trap of getting things more complex for the sake of simplicity.

Latest revision as of 16:51, 17 August 2023


Description

Overview

  • g5k-checks is expected to be integrated into the standard environment of the Grid'5000 computational nodes. It checks that a node meets several basic requirements before it declares itself as available to the OAR server.
  • This lets the admins enable some checkers which may be very specific to the hardware of a cluster.

Architecture

G5kchecks is based on rspec test suite. Rspec is a little bit roundabout of it first mission: test a program. We use rspec to test all node characteristics. The first step is to retrieve node informatation with ohai. By default ohai provides a large set of characteristics of the machine. Added to this, we have developed some plugins to complete missing information (particularly for the disk, the cpu and the network). The second step is to compare those characteristics with the grid5000 Reference_Repository. To do that, g5kchecks takes each value of the API and compares them with the values given by ohai. If those values don't match, then an error is thrown via the rspec process.

OAR

systemd is used to start OAR on nodes, a single service is in charge of managing the sshd daemon (to start or stop it) and launching the oar-node script (/etc/oar/oar-node-service) with an argument (which must be 'start' or 'stop').

  • The oar-node flavour of OAR installation /etc/oar/oar-node-service launches /usr/lib/oar/oarnodecheckrun, which then runs the executable file /etc/oar/check.d/start_g5kchecks. The OAR server periodically invokes remotely /usr/bin/oarnodecheckquery. This command returns with status 1 if /var/lib/oar/check.d/ is not empty, 0 otherwise. So if /etc/oar/check.d/start_g5kchecks finds something wrong, it simply has to create a log file in that directory.
  • If oarnodecheckquery fails, then the node is not ready to start, and it loops on running those scripts until either oarnodecheckquery returns 0 or a timeout is reached. If the timeout is reached, then it does not attempt to declare the node as "Alive".

This summarizes when g5kchecks is run:

  • At oar-node service start with /etc/oar/oar-node-service
  • Between (non-deploy) jobs with remote execution of oarnodecheckrun and oarnodecheckquery (In case of deploy jobs, the first type of execution takes place)
  • Launched by user manually (for now, never happens)

G5kchecks is never run during users jobs.

Checks Overview

The following values are checked by g5k-checks:


# Generated by g5k-checks (g5k-checks -m api)
---
network_adapters:
  bmc:
    ip: 172.17.52.9
    mac: 18:66:da:7c:96:1a
    management: true
  eno1:
    name: eno1
    interface: Ethernet
    driver: tg3
    mac: 18:66:da:7c:96:16
    rate: 0
    firmware_version: FFV20.2.17 bc 5720-v1.39
    model: NetXtreme BCM5720 Gigabit Ethernet PCIe
    vendor: Broadcom
    mounted: false
    management: false
  eno2:
    name: eno2
    interface: Ethernet
    driver: tg3
    mac: 18:66:da:7c:96:17
    rate: 0
    firmware_version: FFV20.2.17 bc 5720-v1.39
    model: NetXtreme BCM5720 Gigabit Ethernet PCIe
    vendor: Broadcom
    mounted: false
    management: false
  eno3:
    name: eno3
    interface: Ethernet
    driver: tg3
    mac: 18:66:da:7c:96:18
    rate: 0
    firmware_version: FFV20.2.17 bc 5720-v1.39
    model: NetXtreme BCM5720 Gigabit Ethernet PCIe
    vendor: Broadcom
    mounted: false
    management: false
  eno4:
    name: eno4
    interface: Ethernet
    driver: tg3
    mac: 18:66:da:7c:96:19
    rate: 0
    firmware_version: FFV20.2.17 bc 5720-v1.39
    model: NetXtreme BCM5720 Gigabit Ethernet PCIe
    vendor: Broadcom
    mounted: false
    management: false
  enp5s0f0:
    name: enp5s0f0
    interface: Ethernet
    ip: 172.16.52.9
    driver: ixgbe
    mac: a0:36:9f:ce:e4:24
    rate: 10000000000
    firmware_version: '0x800007f5'
    model: Ethernet 10G 2P X520 Adapter
    vendor: Intel
    mounted: true
    management: false
  enp5s0f1:
    name: enp5s0f1
    interface: Ethernet
    driver: ixgbe
    mac: a0:36:9f:ce:e4:26
    rate: 0
    firmware_version: '0x800007f5'
    model: Ethernet 10G 2P X520 Adapter
    vendor: Intel
    mounted: false
    management: false
operating_system:
  ht_enabled: true
  pstate_driver: intel_pstate
  pstate_governor: performance
  turboboost_enabled: true
  cstate_driver: intel_idle
  cstate_governor: menu
architecture:
  platform_type: x86_64
  nb_procs: 2
  nb_cores: 16
  nb_threads: 32
chassis:
  serial: 7W26RG2
  manufacturer: Dell Inc.
  name: PowerEdge R430
main_memory:
  ram_size: 68719476736
supported_job_types:
  virtual: ivt
bios:
  vendor: Dell Inc.
  version: 2.2.5
  release_date: '09/08/2016'
processor:
  clock_speed: 2100000000
  instruction_set: x86-64
  model: Intel Xeon
  version: E5-2620 v4
  vendor: Intel
  other_description: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
  cache_l1i: 32768
  cache_l1d: 32768
  cache_l2: 262144
  cache_l3: 20971520
  ht_capable: true
storage_devices:
  sda:
    device: sda
    by_id: "/dev/disk/by-id/wwn-0x6847beb0d535ed001fa67d1a12d0d135"
    by_path: "/dev/disk/by-path/pci-0000:01:00.0-scsi-0:2:0:0"
    size: 598879502336
    model: PERC H330 Mini
    firmware_version: 4.26
    vendor: DELL

This is an example of output file in API mode (g5k-checks launched with -m api option).

In addition, not all tests are exporting data in this file. The following values are also checked:

  • Grid5000 standard environment version
  • Grid5000 post-install scripts version
  • Usage of sudo-g5k (failed if used, could be destructive to other parts of the system)
  • Correct mode of /tmp/
  • Fstab partitions mounted and valids
  • All partitions have expected size, position, offset, mount options, ...
  • Correct KVM driver

Simple usage

Installation

G5kchecks is currently tested on Debian buster. On grid5000 debian repository, just add on /etc/apt/sources.list

deb http://packages-ext.grid5000.fr/deb/g5k-checks/buster /
Terminal.png node:
apt-get update

Install it:

Terminal.png node:
apt-get install g5kchecks

Get sources

git clone https://github.com/grid5000/g5k-checks.git

Run g5k-checks

If you want to check your node just run:

Terminal.png node:
g5k-checks -v

The output should highlight tests in error in red. Also, if some error occured, g5k-checks puts file in /var/lib/g5kchecks/. For instance:

 root@adonis-3:~# g5k-checks
 root@adonis-3:~# ls /var/lib/oar/checklogs/
 OAR_Architecture_should_have_the_correct_number_of_thread

You can see the detail of the values checked this way:

 root@adonis-3:~# cat /var/lib/oar/checklogs/OAR_Architecture_should_have_the_correct_number_of_thread

Get node description

G5k-checks has a double utility. It can check a node description against our reference API and detect errors. But it can also generate the data to populate this reference API.

If you want a exact node description you can run:

Terminal.png node:
g5k-checks -m api

(If launched with -v verbose mode, you can see that almost all tests are failing and it is normal as empty values are checked instead of real ones)

Then g5k-checks put a json and a yaml file in /tmp/

 root@adonis-3:~# g5k-checks -m api
 root@adonis-3:~# ls /tmp/
 adonis-3.grenoble.grid5000.fr.json  adonis-3.grenoble.grid5000.fr.yaml

Write your own checks/description

G5k-checks internal

G5k-checks is written in ruby on top of the rspec test framework. It gathers informations from ohai program and compare them with grid'5000 reference API data. Rspec is simple to read and write, so you can copy easily other checks and adapt them to your needs.

File tree is:

 ├── ohai # Ohai plugins, those informations are use by g5k-checks after
 ├── rspec # Add Rspec formatter (store informations in different way)
 ├── spec # Checks directory
 └── utils # some useful class

Play with ohai

Ohai is a small program who retrieve information from different files/other program on the host. It offers an easy to parse output in Json. We can add information to Json just by writing plugins. For instance if we want to add the version of bash in the description, you can create a small file /usr/lib/ruby/vendor_ruby/g5kchecks/ohai/package_version.rb with:

Ohai.plugin(:Packages) do

  provides "packages"

  collect_data do
      packages Mash.new
      packages[:bash] = `dpkg -l | grep bash | awk '{print $3}'`
      packages
  end
end

Play with Rspec

Rspec is a framework for testing ruby programs. G5k-checks use Rspec, not to test a ruby program, but to test host. Rspec is simple to read and write. For instance if we want to ensure that bash version is the good one, you can create a file /usr/lib/ruby/vendor_ruby/g5kchecks/spec/packages/packages_spec.rb with :

 describe "Packages" do
                                                                                                                                           
   before(:all) do                                                                                                                         
     @system = RSpec.configuration.node.ohai_description
   end
   
   it "bash should should have the good version" do                                                                                        
     puts @system[:packages][:bash].to_yaml
     bash_version = @system[:packages][:bash].strip                                                                                        
     bash_version.should eql("4.2+dfsg-0.1"), "#{bash_version}, 4.2+dfsg-0.1, packages, bash"                                              
   end
       
 end

Add checks

Example: I want to check if flag "acpi" is available on the processor:

Add to /usr/lib/ruby/vendor_ruby/g5kchecks/spec/processor/processor_spec.rb:

 it "should have apci" do
   acpi_ohai = @system[:cpu][:'0'][:flags].include?('acpi')
   acpi_ohai.should_not be_false, "#{acpi_ohai}, is not acpi, processor, acpi"
 end

Add informations in description

Example: I want to add bogomips of node:

First we should add information in ohai description. To do this we add in the file ohai/cpu.rb after line 80:

    if line =~ /^BogoMIPS/
      cpu[:Bogo] = line.chomp.split(": ").last.lstrip
    end

Then we can retrieve information and add it to the description. To do this we add in /usr/lib/ruby/vendor_ruby/g5kchecks/spec/processor/processor_spec.rb:

    it "should have BogoMIPS" do
      bogo_ohai = @system[:cpu][:Bogo]
      #First value is system, second is from API, thirs is the YAML path in the created '/tmp/' file for -m api mode.
      #Last argument is false to export value in API mode, true to skip
      Utils.test(bogo_ohai, nil, 'processor/bogoMIPS', false) do |v_ohai, v_api, error_msg|
          expect(v_ohai).to eql(v_api), error_msg
      end
    end

Now you have the information in /tmp/mynode.mysite.grid5000.fr.yaml:

   root@graphene-100:/usr/lib/ruby/vendor_ruby/g5kchecks# g5k-checks -m api
   root@graphene-100:/usr/lib/ruby/vendor_ruby/g5kchecks# grep -C 3 bogo /tmp/graphene-100.nancy.grid5000.fr.yaml 
     ram_size: 16860348416
   processor:
     clock_speed: 2530000000
     bogoMIPS: 5053.74
     instruction_set: x86-64
     model: Intel Xeon
     version: X3440

Releasing and testing

Tests and reference-repository update

Before creating a new standard environment, g5k-checks can be tested on target environments using the jenkins test: https://intranet.grid5000.fr/jenkins/job/test_g5kchecksdev

This test can reserve all or the maximum possible nodes (targets cluster-ALL and cluster-BEST) on each cluster of Grid5000.

It will checkout a (configurable) branch of g5k-checks and test it against a (configurable) branch of the reference-api.

The test will fail if mandatory test fails (i.e. there are entries in /var/lib/oar/checklogs).

Also, the Yaml output of the "-m api" option of g5k-checks will be written to $HOME/g5k-checks-output directory of the ajenkins user on the target site.

Note: it is possible to change the branches of both reference-repository and g5k-checks for the test by configuring the jenkins test:

  cd /srv/jenkins-scripts && ruby -Ilib -rg5kcheckstester -e "G5KChecksTester::new.test('$site_cluster', 'custom', 'dev_feature', 'dev_feature_refrepo')"

For example, this will take the 'dev_feature' branch of g5kcheck and test it against the data present in the 'dev_feature_refrepo' branch of the reference-api.

Updating the reference-repository

Once the tests are finished on the desired clusters, generated Yaml files must be imported manually.

  • In the reference repository, go in the generators/run-g5kchecks directory.
  • Now get yaml files you want to include. For example:
rsync -va "rennes.adm:/home/ajenkins/g5k-checks-output/paravance*.yaml" ./output/

The output directory hold the temporary files that will be included as input in the reference-repository.

  • Then import YAML files into the reference-repository with:
rake g5k-checks-import SOURCEDIR=<path to the output dir>

If values seem correct, generate JSON and commit:
<pre>
  rake reference-api
  git diff data/ 
  git add data input
  git commit -m"[SITE] g5k-checks updates"

Release a new version

Once modifications are tested correct on a maximum of clusters, a new version can be released.

See here for general instructions about the release workflow.

Environment update

The version of g5k-checks included in standard environment is defined in the following file:

steps/data/setup/puppet/modules/env/manifests/common/software_versions.pp

Once the environment is correct and its version updated, it can be generated with the automated jenkins job: https://intranet.grid5000.fr/jenkins/job/env_generate/

New environment release and reference-api update guidelines

The following procedure summarizes the steps taken to test and deploy a new environment with g5k-checks.

G5k-checks relies on the reference-api to check system data against it. Data from the reference-api must be up-to-date for tests to succeed but most of this data is generated by g5k-checks itself, creating a sort of 'circular dependency'. To avoid dead nodes, g5k-checks data from all nodes should be gathered before pushing a new environment.

  • Do a reservation of all nodes of G5K, for example: oarsub -t placeholder=maintenance -l nodes=BEST,walltime=06:00 -q admin -n 'Maintenance' -r '2017-08-31 09:00:00'

The reservation should happen early enough to ensure most (ideally all) of the resources will be available at that time.

  • Now g5k-checks should be run on all reserved nodes in 'api' mode in order to retrieve the yaml description that will be used to update the reference-api.

This step might be the most tedious one but can be done before the actual deployment. See #Tests and reference-repository update

  • Commit and push theses changes to master branch of the reference-repository

The jenkins job does a oar reservation of type 'destructive' that will force the deployment of the new environment.

  • If not all nodes were available at the time of the new g5k-checks data retrieval (which is often the case) or during environment update, open a bug # for all sites to let site administrators finish running g5k-checks on remaining nodes.

Run G5k-checks on non-reservable nodes

It is common to update the reference-repository values of nodes whose state are 'Dead' on OAR.

An adaptation of the jenkins g5k-checks test has been made to allow running the same test without doing a OAR reservation.

The only difference is that instead of using OAR to reserve nodes and Kadeploy API to deploy, the nodes are given directly as arguments and kadeploy is called directly from site's frontends.

This scripts must be run on the jenkins machine:

cd /srv/jenkins-scripts 
ruby -Ilib -rg5kcheckstester -e "G5KChecksTester::new.from_nodes_list()" grisou-{15,16,18}.nancy.grid5000.fr

Once done, the procedure is the same as described in #Updating the reference-repository.

Run/fix g5k-checks on a new node of a cluster that is not fully operational yet

You may have to work on adapting g5k-checks for a new type of node of a cluster that is not fully integrated in Grid'5000, e.g. not in the API and kadeploy yet.

Even if kadeploy is not fully operational, you should have a "std" environment running on the node, in order to be as close as possible to what g5k-checks expects in terms of OS.

Then, you can try and run g5k-checks -m api in order to generate a yaml description of the node (yaml file generated in /tmp). Of course for a new node, this may not work out of the box: adaptation of g5k-checks may be needed at that first stage already.

Then, you can add the node description in the reference API, using rake g5k-checks-import=DIR in a test branch (e.g. named TESTBRANCH), and do the necessary to get that branch published in the API:

rake valid:schema
rake valid:duplicates
rake reference-api

Now you should be able to fetch the API info for the node using curl -s "https://api.grid5000.fr/stable/sites/toulouse/clusters/NEWCLUSTER/nodes/NEWCLUSTER-N?branch=TESTBRANCH" | jq

On the node, you should now be able to test g5k-checks against the API information. To do so, first modify /etc/g5k-checks.conf to set the branch to TESTBRANCH. Then run g5k-checks -k -v (-k disables the kadeploy status check).

At this stage again, adapting g5k-check may be needed, possibly directly on the node for quick tests.

Fix in git in a test branch: once the CI is ok, packages are available as artifacts in the CI pipeline, which can be installed on the node with 'dpkg -i ...' initially.