G5k-checks: Difference between revisions
(5 intermediate revisions by 4 users not shown) | |||
Line 20: | Line 20: | ||
== OAR == | == OAR == | ||
* The oar-node flavour of OAR installation <code class="command">/etc/ | systemd is used to start OAR on nodes, a single service is in charge of managing the sshd daemon (to start or stop it) and launching the oar-node script (<code class="command">/etc/oar/oar-node-service</code>) with an argument (which must be 'start' or 'stop'). | ||
* The oar-node flavour of OAR installation <code class="command">/etc/oar/oar-node-service</code> launches <code class="command">/usr/lib/oar/oarnodecheckrun</code>, which then runs the executable file <code class="command">/etc/oar/check.d/start_g5kchecks</code>. The OAR server periodically invokes remotely <code class="command">/usr/bin/oarnodecheckquery</code>. This command returns with status 1 if <code class="command">/var/lib/oar/check.d/</code> is not empty, 0 otherwise. So if <code class="command">/etc/oar/check.d/start_g5kchecks</code> finds something wrong, it simply has to create a log file in that directory. | |||
* If <code class="command">oarnodecheckquery</code> fails, then the node is not ready to start, and it loops on running those scripts until either <code class="command">oarnodecheckquery</code> returns 0 or a timeout is reached. If the timeout is reached, then it does not attempt to declare the node as "Alive". | * If <code class="command">oarnodecheckquery</code> fails, then the node is not ready to start, and it loops on running those scripts until either <code class="command">oarnodecheckquery</code> returns 0 or a timeout is reached. If the timeout is reached, then it does not attempt to declare the node as "Alive". | ||
This summarizes when g5kchecks is run: | This summarizes when g5kchecks is run: | ||
* At service start with <code class="command">/etc/ | * At oar-node service start with <code class="command">/etc/oar/oar-node-service</code> | ||
* Between (non-deploy) jobs with remote execution of <code class="command">oarnodecheckrun</code> and <code class="command">oarnodecheckquery</code> (In case of deploy jobs, the first type of execution takes place) | * Between (non-deploy) jobs with remote execution of <code class="command">oarnodecheckrun</code> and <code class="command">oarnodecheckquery</code> (In case of deploy jobs, the first type of execution takes place) | ||
* Launched by user manually (for now, never happens) | * Launched by user manually (for now, never happens) | ||
Line 172: | Line 174: | ||
== Installation == | == Installation == | ||
G5kchecks | G5kchecks is currently tested on Debian buster. On grid5000 debian repository, just add on /etc/apt/sources.list | ||
deb http:// | deb http://packages-ext.grid5000.fr/deb/g5k-checks/buster / | ||
{{Term|location=node|cmd=<code class="command">apt-get</code> update }} | {{Term|location=node|cmd=<code class="command">apt-get</code> update }} | ||
Line 358: | Line 360: | ||
Once modifications are tested correct on a maximum of clusters, a new version can be released. | Once modifications are tested correct on a maximum of clusters, a new version can be released. | ||
See [[TechTeam:Git_Packaging_and_Deployment#D.C3.A9tails_du_workflow_de_release_.28et_configuration_de_gitlab-ci.29|here]] for general instructions about the release workflow. | See '''[[TechTeam:Git_Packaging_and_Deployment#D.C3.A9tails_du_workflow_de_release_.28et_configuration_de_gitlab-ci.29|here]]''' for general instructions about the release workflow. | ||
== Environment update == | == Environment update == | ||
Line 409: | Line 411: | ||
Once done, the procedure is the same as described in [[#Updating the reference-repository]]. | Once done, the procedure is the same as described in [[#Updating the reference-repository]]. | ||
== Run/fix g5k-checks on a new node of a cluster that is not fully operational yet == | |||
You may have to work on adapting g5k-checks for a new type of node of a cluster that is not fully integrated in Grid'5000, e.g. not in the API and kadeploy yet. | |||
Even if kadeploy is not fully operational, you should have a "std" environment running on the node, in order to be as close as possible to what g5k-checks expects in terms of OS. | |||
Then, you can try and run <code class=command>g5k-checks -m api</code> in order to generate a yaml description of the node (yaml file generated in /tmp). Of course for a new node, this may not work out of the box: adaptation of g5k-checks may be needed at that first stage already. | |||
Then, you can add the node description in the reference API, using <code class=command>rake g5k-checks-import=DIR</code> in a test branch (e.g. named TESTBRANCH), and do the necessary to get that branch published in the API: | |||
rake valid:schema | |||
rake valid:duplicates | |||
rake reference-api | |||
Now you should be able to fetch the API info for the node using <code class=command>curl -s "https://api.grid5000.fr/stable/sites/toulouse/clusters/NEWCLUSTER/nodes/NEWCLUSTER-N?branch=TESTBRANCH" | jq</code> | |||
On the node, you should now be able to test g5k-checks against the API information. To do so, first modify <code class=file>/etc/g5k-checks.conf</code> to set the branch to TESTBRANCH. Then run <code class=command>g5k-checks -k -v</code> (-k disables the kadeploy status check). | |||
At this stage again, adapting g5k-check may be needed, possibly directly on the node for quick tests. | |||
Fix in git in a test branch: once the CI is ok, packages are available as artifacts in the CI pipeline, which can be installed on the node with 'dpkg -i ...' initially. |
Latest revision as of 15:51, 17 August 2023
Description
Overview
- g5k-checks is expected to be integrated into the standard environment of the Grid'5000 computational nodes. It checks that a node meets several basic requirements before it declares itself as available to the OAR server.
- This lets the admins enable some checkers which may be very specific to the hardware of a cluster.
Architecture
G5kchecks is based on rspec test suite. Rspec is a little bit roundabout of it first mission: test a program. We use rspec to test all node characteristics. The first step is to retrieve node informatation with ohai. By default ohai provides a large set of characteristics of the machine. Added to this, we have developed some plugins to complete missing information (particularly for the disk, the cpu and the network). The second step is to compare those characteristics with the grid5000 Reference_Repository. To do that, g5kchecks takes each value of the API and compares them with the values given by ohai. If those values don't match, then an error is thrown via the rspec process.
OAR
systemd is used to start OAR on nodes, a single service is in charge of managing the sshd daemon (to start or stop it) and launching the oar-node script (/etc/oar/oar-node-service
) with an argument (which must be 'start' or 'stop').
- The oar-node flavour of OAR installation
/etc/oar/oar-node-service
launches/usr/lib/oar/oarnodecheckrun
, which then runs the executable file/etc/oar/check.d/start_g5kchecks
. The OAR server periodically invokes remotely/usr/bin/oarnodecheckquery
. This command returns with status 1 if/var/lib/oar/check.d/
is not empty, 0 otherwise. So if/etc/oar/check.d/start_g5kchecks
finds something wrong, it simply has to create a log file in that directory. - If
oarnodecheckquery
fails, then the node is not ready to start, and it loops on running those scripts until eitheroarnodecheckquery
returns 0 or a timeout is reached. If the timeout is reached, then it does not attempt to declare the node as "Alive".
This summarizes when g5kchecks is run:
- At oar-node service start with
/etc/oar/oar-node-service
- Between (non-deploy) jobs with remote execution of
oarnodecheckrun
andoarnodecheckquery
(In case of deploy jobs, the first type of execution takes place) - Launched by user manually (for now, never happens)
G5kchecks is never run during users jobs.
Checks Overview
The following values are checked by g5k-checks:
# Generated by g5k-checks (g5k-checks -m api) --- network_adapters: bmc: ip: 172.17.52.9 mac: 18:66:da:7c:96:1a management: true eno1: name: eno1 interface: Ethernet driver: tg3 mac: 18:66:da:7c:96:16 rate: 0 firmware_version: FFV20.2.17 bc 5720-v1.39 model: NetXtreme BCM5720 Gigabit Ethernet PCIe vendor: Broadcom mounted: false management: false eno2: name: eno2 interface: Ethernet driver: tg3 mac: 18:66:da:7c:96:17 rate: 0 firmware_version: FFV20.2.17 bc 5720-v1.39 model: NetXtreme BCM5720 Gigabit Ethernet PCIe vendor: Broadcom mounted: false management: false eno3: name: eno3 interface: Ethernet driver: tg3 mac: 18:66:da:7c:96:18 rate: 0 firmware_version: FFV20.2.17 bc 5720-v1.39 model: NetXtreme BCM5720 Gigabit Ethernet PCIe vendor: Broadcom mounted: false management: false eno4: name: eno4 interface: Ethernet driver: tg3 mac: 18:66:da:7c:96:19 rate: 0 firmware_version: FFV20.2.17 bc 5720-v1.39 model: NetXtreme BCM5720 Gigabit Ethernet PCIe vendor: Broadcom mounted: false management: false enp5s0f0: name: enp5s0f0 interface: Ethernet ip: 172.16.52.9 driver: ixgbe mac: a0:36:9f:ce:e4:24 rate: 10000000000 firmware_version: '0x800007f5' model: Ethernet 10G 2P X520 Adapter vendor: Intel mounted: true management: false enp5s0f1: name: enp5s0f1 interface: Ethernet driver: ixgbe mac: a0:36:9f:ce:e4:26 rate: 0 firmware_version: '0x800007f5' model: Ethernet 10G 2P X520 Adapter vendor: Intel mounted: false management: false operating_system: ht_enabled: true pstate_driver: intel_pstate pstate_governor: performance turboboost_enabled: true cstate_driver: intel_idle cstate_governor: menu architecture: platform_type: x86_64 nb_procs: 2 nb_cores: 16 nb_threads: 32 chassis: serial: 7W26RG2 manufacturer: Dell Inc. name: PowerEdge R430 main_memory: ram_size: 68719476736 supported_job_types: virtual: ivt bios: vendor: Dell Inc. version: 2.2.5 release_date: '09/08/2016' processor: clock_speed: 2100000000 instruction_set: x86-64 model: Intel Xeon version: E5-2620 v4 vendor: Intel other_description: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz cache_l1i: 32768 cache_l1d: 32768 cache_l2: 262144 cache_l3: 20971520 ht_capable: true storage_devices: sda: device: sda by_id: "/dev/disk/by-id/wwn-0x6847beb0d535ed001fa67d1a12d0d135" by_path: "/dev/disk/by-path/pci-0000:01:00.0-scsi-0:2:0:0" size: 598879502336 model: PERC H330 Mini firmware_version: 4.26 vendor: DELL
This is an example of output file in API mode (g5k-checks launched with -m api option).
In addition, not all tests are exporting data in this file. The following values are also checked:
- Grid5000 standard environment version
- Grid5000 post-install scripts version
- Usage of sudo-g5k (failed if used, could be destructive to other parts of the system)
- Correct mode of /tmp/
- Fstab partitions mounted and valids
- All partitions have expected size, position, offset, mount options, ...
- Correct KVM driver
Simple usage
Installation
G5kchecks is currently tested on Debian buster. On grid5000 debian repository, just add on /etc/apt/sources.list
deb http://packages-ext.grid5000.fr/deb/g5k-checks/buster /
Install it:
Get sources
git clone https://github.com/grid5000/g5k-checks.git
Run g5k-checks
If you want to check your node just run:
The output should highlight tests in error in red. Also, if some error occured, g5k-checks puts file in /var/lib/g5kchecks/. For instance:
root@adonis-3:~# g5k-checks root@adonis-3:~# ls /var/lib/oar/checklogs/ OAR_Architecture_should_have_the_correct_number_of_thread
You can see the detail of the values checked this way:
root@adonis-3:~# cat /var/lib/oar/checklogs/OAR_Architecture_should_have_the_correct_number_of_thread
Get node description
G5k-checks has a double utility. It can check a node description against our reference API and detect errors. But it can also generate the data to populate this reference API.
If you want a exact node description you can run:
(If launched with -v verbose mode, you can see that almost all tests are failing and it is normal as empty values are checked instead of real ones)
Then g5k-checks put a json and a yaml file in /tmp/
root@adonis-3:~# g5k-checks -m api root@adonis-3:~# ls /tmp/ adonis-3.grenoble.grid5000.fr.json adonis-3.grenoble.grid5000.fr.yaml
Write your own checks/description
G5k-checks internal
G5k-checks is written in ruby on top of the rspec test framework. It gathers informations from ohai program and compare them with grid'5000 reference API data. Rspec is simple to read and write, so you can copy easily other checks and adapt them to your needs.
File tree is:
├── ohai # Ohai plugins, those informations are use by g5k-checks after ├── rspec # Add Rspec formatter (store informations in different way) ├── spec # Checks directory └── utils # some useful class
Play with ohai
Ohai is a small program who retrieve information from different files/other program on the host. It offers an easy to parse output in Json. We can add information to Json just by writing plugins. For instance if we want to add the version of bash in the description, you can create a small file /usr/lib/ruby/vendor_ruby/g5kchecks/ohai/package_version.rb with:
Ohai.plugin(:Packages) do provides "packages" collect_data do packages Mash.new packages[:bash] = `dpkg -l | grep bash | awk '{print $3}'` packages end end
Play with Rspec
Rspec is a framework for testing ruby programs. G5k-checks use Rspec, not to test a ruby program, but to test host. Rspec is simple to read and write. For instance if we want to ensure that bash version is the good one, you can create a file /usr/lib/ruby/vendor_ruby/g5kchecks/spec/packages/packages_spec.rb with :
describe "Packages" do before(:all) do @system = RSpec.configuration.node.ohai_description end it "bash should should have the good version" do puts @system[:packages][:bash].to_yaml bash_version = @system[:packages][:bash].strip bash_version.should eql("4.2+dfsg-0.1"), "#{bash_version}, 4.2+dfsg-0.1, packages, bash" end end
Add checks
Example: I want to check if flag "acpi" is available on the processor:
Add to /usr/lib/ruby/vendor_ruby/g5kchecks/spec/processor/processor_spec.rb:
it "should have apci" do acpi_ohai = @system[:cpu][:'0'][:flags].include?('acpi') acpi_ohai.should_not be_false, "#{acpi_ohai}, is not acpi, processor, acpi" end
Add informations in description
Example: I want to add bogomips of node:
First we should add information in ohai description. To do this we add in the file ohai/cpu.rb after line 80:
if line =~ /^BogoMIPS/ cpu[:Bogo] = line.chomp.split(": ").last.lstrip end
Then we can retrieve information and add it to the description. To do this we add in /usr/lib/ruby/vendor_ruby/g5kchecks/spec/processor/processor_spec.rb:
it "should have BogoMIPS" do bogo_ohai = @system[:cpu][:Bogo] #First value is system, second is from API, thirs is the YAML path in the created '/tmp/' file for -m api mode. #Last argument is false to export value in API mode, true to skip Utils.test(bogo_ohai, nil, 'processor/bogoMIPS', false) do |v_ohai, v_api, error_msg| expect(v_ohai).to eql(v_api), error_msg end end
Now you have the information in /tmp/mynode.mysite.grid5000.fr.yaml:
root@graphene-100:/usr/lib/ruby/vendor_ruby/g5kchecks# g5k-checks -m api root@graphene-100:/usr/lib/ruby/vendor_ruby/g5kchecks# grep -C 3 bogo /tmp/graphene-100.nancy.grid5000.fr.yaml ram_size: 16860348416 processor: clock_speed: 2530000000 bogoMIPS: 5053.74 instruction_set: x86-64 model: Intel Xeon version: X3440
Releasing and testing
Tests and reference-repository update
Before creating a new standard environment, g5k-checks can be tested on target environments using the jenkins test: https://intranet.grid5000.fr/jenkins/job/test_g5kchecksdev
This test can reserve all or the maximum possible nodes (targets cluster-ALL and cluster-BEST) on each cluster of Grid5000.
It will checkout a (configurable) branch of g5k-checks and test it against a (configurable) branch of the reference-api.
The test will fail if mandatory test fails (i.e. there are entries in /var/lib/oar/checklogs).
Also, the Yaml output of the "-m api" option of g5k-checks will be written to $HOME/g5k-checks-output directory of the ajenkins user on the target site.
Note: it is possible to change the branches of both reference-repository and g5k-checks for the test by configuring the jenkins test:
cd /srv/jenkins-scripts && ruby -Ilib -rg5kcheckstester -e "G5KChecksTester::new.test('$site_cluster', 'custom', 'dev_feature', 'dev_feature_refrepo')"
For example, this will take the 'dev_feature' branch of g5kcheck and test it against the data present in the 'dev_feature_refrepo' branch of the reference-api.
Updating the reference-repository
Once the tests are finished on the desired clusters, generated Yaml files must be imported manually.
- In the reference repository, go in the generators/run-g5kchecks directory.
- Now get yaml files you want to include. For example:
rsync -va "rennes.adm:/home/ajenkins/g5k-checks-output/paravance*.yaml" ./output/
The output directory hold the temporary files that will be included as input in the reference-repository.
- Then import YAML files into the reference-repository with:
rake g5k-checks-import SOURCEDIR=<path to the output dir> If values seem correct, generate JSON and commit: <pre> rake reference-api git diff data/ git add data input git commit -m"[SITE] g5k-checks updates"
Release a new version
Once modifications are tested correct on a maximum of clusters, a new version can be released.
See here for general instructions about the release workflow.
Environment update
The version of g5k-checks included in standard environment is defined in the following file:
steps/data/setup/puppet/modules/env/manifests/common/software_versions.pp
Once the environment is correct and its version updated, it can be generated with the automated jenkins job: https://intranet.grid5000.fr/jenkins/job/env_generate/
New environment release and reference-api update guidelines
The following procedure summarizes the steps taken to test and deploy a new environment with g5k-checks.
G5k-checks relies on the reference-api to check system data against it. Data from the reference-api must be up-to-date for tests to succeed but most of this data is generated by g5k-checks itself, creating a sort of 'circular dependency'. To avoid dead nodes, g5k-checks data from all nodes should be gathered before pushing a new environment.
- Do a reservation of all nodes of G5K, for example: oarsub -t placeholder=maintenance -l nodes=BEST,walltime=06:00 -q admin -n 'Maintenance' -r '2017-08-31 09:00:00'
The reservation should happen early enough to ensure most (ideally all) of the resources will be available at that time.
- Prepare and release a new debian package of g5k-checks (see #Release a new version)
- Prepare a new standard environment with this new g5k-checks version (see #Environment update)
- Now g5k-checks should be run on all reserved nodes in 'api' mode in order to retrieve the yaml description that will be used to update the reference-api.
This step might be the most tedious one but can be done before the actual deployment. See #Tests and reference-repository update
- Commit and push theses changes to master branch of the reference-repository
- Soon after, push new environment version to all sites using the automated jenkins job: https://intranet.grid5000.fr/jenkins/job/env_push/
The jenkins job does a oar reservation of type 'destructive' that will force the deployment of the new environment.
- If not all nodes were available at the time of the new g5k-checks data retrieval (which is often the case) or during environment update, open a bug # for all sites to let site administrators finish running g5k-checks on remaining nodes.
Run G5k-checks on non-reservable nodes
It is common to update the reference-repository values of nodes whose state are 'Dead' on OAR.
An adaptation of the jenkins g5k-checks test has been made to allow running the same test without doing a OAR reservation.
The only difference is that instead of using OAR to reserve nodes and Kadeploy API to deploy, the nodes are given directly as arguments and kadeploy is called directly from site's frontends.
This scripts must be run on the jenkins machine:
cd /srv/jenkins-scripts
ruby -Ilib -rg5kcheckstester -e "G5KChecksTester::new.from_nodes_list()" grisou-{15,16,18}.nancy.grid5000.fr
Once done, the procedure is the same as described in #Updating the reference-repository.
Run/fix g5k-checks on a new node of a cluster that is not fully operational yet
You may have to work on adapting g5k-checks for a new type of node of a cluster that is not fully integrated in Grid'5000, e.g. not in the API and kadeploy yet.
Even if kadeploy is not fully operational, you should have a "std" environment running on the node, in order to be as close as possible to what g5k-checks expects in terms of OS.
Then, you can try and run g5k-checks -m api
in order to generate a yaml description of the node (yaml file generated in /tmp). Of course for a new node, this may not work out of the box: adaptation of g5k-checks may be needed at that first stage already.
Then, you can add the node description in the reference API, using rake g5k-checks-import=DIR
in a test branch (e.g. named TESTBRANCH), and do the necessary to get that branch published in the API:
rake valid:schema rake valid:duplicates rake reference-api
Now you should be able to fetch the API info for the node using curl -s "https://api.grid5000.fr/stable/sites/toulouse/clusters/NEWCLUSTER/nodes/NEWCLUSTER-N?branch=TESTBRANCH" | jq
On the node, you should now be able to test g5k-checks against the API information. To do so, first modify /etc/g5k-checks.conf
to set the branch to TESTBRANCH. Then run g5k-checks -k -v
(-k disables the kadeploy status check).
At this stage again, adapting g5k-check may be needed, possibly directly on the node for quick tests.
Fix in git in a test branch: once the CI is ok, packages are available as artifacts in the CI pipeline, which can be installed on the node with 'dpkg -i ...' initially.