G5k-checks: Difference between revisions
(→oar) |
(→oar) |
||
Line 23: | Line 23: | ||
* If <code class="command">oarnodecheckquery</code> fails, then the node is not ready to start, and it loops on running those scripts until either <code class="command">oarnodecheckquery</code> returns 0 or a timeout is reached. If the timeout is reached, then it does not attempt to declare the node as "Alive". | * If <code class="command">oarnodecheckquery</code> fails, then the node is not ready to start, and it loops on running those scripts until either <code class="command">oarnodecheckquery</code> returns 0 or a timeout is reached. If the timeout is reached, then it does not attempt to declare the node as "Alive". | ||
This | This summarizes when g5kchecks is run: | ||
* At service start with <code class="command">/etc/default/oar-node</code> | |||
* Between (non-deploy) jobs with remote execution of <code class="command">oarnodecheckrun</code> and <code class="command">oarnodecheckquery</code> (In case of deploy jobs, the first type of execution takes place) | |||
* Launched by user manually | |||
= Checks Overview = | = Checks Overview = |
Revision as of 18:34, 14 June 2017
Description
Overview
- g5k-checks is expected to be integrated into the standard environment of the Grid'5000 computational nodes. It checks that a node meets several basic requirements before it declares itself as available to the OAR server.
- This lets the admins enable some checkers which may be very specific to the hardware of a cluster.
Architecture
G5kchecks is based on rspec test suite. Rspec is a little bit roundabout of it first mission: test a program. We use rspec to test all node characteristics. The first step is to retrieve node informatation with ohai. By default ohai provides a large set of characteristics of the machine. Added to this, we have developed some plugins to complete missing information (particularly for the disk, the cpu and the network). The second step is to compare those characteristics with the grid5000 Reference_Repository. To do that, g5kchecks takes each value of the API and compares them with the values given by ohai. If those values don't match, then an error is thrown via the rspec process.
oar
- The oar-node flavour of OAR installation
/etc/default/oar-node
is started at at boot time. It launches/usr/lib/oar/oarnodecheckrun
, which then runs the executable file/etc/oar/check.d/start_g5kchecks
. The OAR server periodically invokes remotely/usr/bin/oarnodecheckquery
. This command returns with status 1 if/var/lib/oar/check.d/
is not empty, 0 otherwise. So if/etc/oar/check.d/start_g5kchecks
finds something wrong, it simply has to create a log file in that directory. - If
oarnodecheckquery
fails, then the node is not ready to start, and it loops on running those scripts until eitheroarnodecheckquery
returns 0 or a timeout is reached. If the timeout is reached, then it does not attempt to declare the node as "Alive".
This summarizes when g5kchecks is run:
- At service start with
/etc/default/oar-node
- Between (non-deploy) jobs with remote execution of
oarnodecheckrun
andoarnodecheckquery
(In case of deploy jobs, the first type of execution takes place) - Launched by user manually
Checks Overview
legend
:-) | means |
---|---|
![]() |
no test |
![]() |
test |
![]() |
test but doesn't work on each cluster |
![]() |
don't know if we could test |
g5k-parts
g5k-parts is designed to run at both phases of g5k-checks (see above).
- In Phase 1, g5k-parts validates the partitioning of a Grid'5000 computational node against the G5K Node Storage convention: all partitions but /tmp are primary, and /tmp is a logical partition inside the only extended partition.
- It first compares /etc/fstab with its backup generated at deployment time. When errors are found at this level, /etc/fstab is reset and the machine reboots.
- Then for every partition given on the command line, it first matches its geometry on the hard drive with the partition layout saved at deployment time. In the new g5kchecks, we decide that no formating is doing after an error (let's do that with charon )
Clock
G5kchecks ensure that the node is up to time by perform tree step:
* stop the ntp client; * synchronize with the ntp server of the site * start the client
If the OS clock is different from hardware clock than g5kchecks puts the good time on the hardware clock. It ensure that the hardware clock is right and was not set by another user during another deployment.
Virtual Hardware
ref API | check ? | comment(s) |
---|---|---|
supported_job_types_virtual |
Architecture
ref API | check ? | comment(s) |
---|---|---|
architecture_platform_type |
platform type (x86_64 ...) | |
architecture_nb_procs |
number of procs | |
architecture_nb_cores |
number of cores | |
architecture_nb_threads |
number of thread |
Bios
ref API | check ? | comment(s) |
---|---|---|
bios_version |
||
bios_vendor |
||
bios_release_date |
BMC
ref API | check ? | comment(s) |
---|---|---|
network_adapters_bmc_ip |
Can, but ipmitool is not present in standard environment | |
network_adapters_bmc_mac |
Can, but ipmitool is not present in standard environment | |
network_adapters_bmc_managment |
Can, but ipmitool is not present in standard environment |
Chassis
ref API | check ? | comment(s) |
---|---|---|
chassis_serial_number |
||
chassis_manufacturer |
||
chassis_product_name |
Disk
ref API | check ? | comment(s) |
---|---|---|
storage_devices_*_device |
||
storage_devices_*_size |
||
storage_devices_*_model |
||
storage_devices_*_rev |
||
storage_devices_*_driver |
||
storage_devices_*_interface |
||
storage_devices_*_by_id |
Memory
ref API | check ? | comment(s) |
---|---|---|
main_memory_ram_size |
Network
ref API | check ? | comment(s) |
---|---|---|
network_adapters_*_device |
||
network_adapters_*_interface |
||
network_adapters_*_ip4 |
||
network_adapters_*_ip6 |
||
network_adapters_*_switch |
||
network_adapters_*_switch_port |
||
network_adapters_*_bridged |
||
network_adapters_*_driver |
||
network_adapters_*_mac |
||
network_adapters_*_guid |
||
network_adapters_*_rate |
||
network_adapters_*_version |
||
network_adapters_*_vendor |
||
network_adapters_*_mounted |
||
network_adapters_*_management |
OS
ref API | check ? | comment(s) |
---|---|---|
operating_system_name |
||
operating_system_kernel |
||
operating_system_version |
Processor
ref API | check ? | comment(s) |
---|---|---|
processor_clock_speed |
||
processor_instruction_set |
||
processor_model |
||
processor_version |
||
processor_vendor |
||
processor_description |
||
processor_cache_l2 |
||
processor_cache_l3 |
||
processor_cache_l1 |
||
processor_cache_l1d |
||
turboboost_enabled |
Simple usage
Installation
G5kchecks is has been tested for wheezy and jessie on grid5000 debian repository, just add on /etc/apt/sources.list
deb http://apt.grid5000.fr/debian sid main
Get grid5000 keyring (A5ED59A7AF7F6E3B):
Install it:
Get sources
{{Term|location=node|cmd=git
clone https://github.com/grid5000/g5k-checks.git
Run g5k-checks
If you want to check your node just run:
If some error occurs, g5k-checks puts file in /var/lib/g5kchecks/. For instance:
root@adonis-3:~# g5k-checks root@adonis-3:~# ls /var/lib/oar/checklogs/ OAR_Architecture_should_have_the_correct_number_of_thread
root@adonis-3:~# cat /var/lib/oar/checklogs/OAR_Architecture_should_have_the_correct_number_of_thread {"started_at":"2013-09-25 15:07:16 +0200","exception":"16, 8, architecture, nb_threads", "status":"failed","finished_at":"2013-09-25 15:07:16 +0200","run_time":0.000155442}
This means that adonis-3 haven't good number of thread (nb_threads is 16 instead of 8).
Get node description
If you want a exact node description you can run:
Then g5k-checks put a json and a yaml file in /tmp/
root@adonis-3:~# g5k-checks -m api root@adonis-3:~# ls /tmp/ adonis-3.grenoble.grid5000.fr.json adonis-3.grenoble.grid5000.fr.yaml lost+found
Write your own checks/description
G5k-checks internal
G5k-checks is written in ruby on top of the rspec test framework. It gathers informations from ohai program and compare them with grid'5000 reference API data. Rspec is simple to read and write, so you can copy easily other checks and adapt them to your needs.
On Debian, installed files are stored in /usr/lib/ruby/vendor_ruby/g5kchecks. Tree is:
├── ohai # Add information to ohai, those informations are use by g5k-checks after ├── rspec # Add Rspec formatter (store informations in different way) ├── spec # Checks directory └── utils # some useful class
Play with ohai
Ohai is a small program who retrieve information from different files/other program on the host. It offers an easy to parse output in Json. We can add information to Json just by writing plugins. For instance if we want to add the version of bash in the description, you can create a small file /usr/lib/ruby/vendor_ruby/g5kchecks/ohai/package_version.rb with:
provides "packages" packages Mash.new packages[:bash] = `dpkg -l | grep bash | awk '{print $3}'`
Play with Rspec
Rspec is a framework for testing ruby programs. G5k-checks use Rspec, not to test a ruby program, but to test host. Rspec is simple to read and write. For instance if we want to ensure that bash version is the good one, you can create a file /usr/lib/ruby/vendor_ruby/g5kchecks/spec/packages/packages_spec.rb with :
describe "Packages" do before(:all) do @system = RSpec.configuration.node.ohai_description end it "bash should should have the good version" do puts @system[:packages][:bash].to_yaml bash_version = @system[:packages][:bash].strip bash_version.should eql("4.2+dfsg-0.1"), "#{bash_version}, 4.2+dfsg-0.1, packages, bash" end end
Add checks
Example: I want to check if flag "acpi" is available on the processor:
Add to /usr/lib/ruby/vendor_ruby/g5kchecks/spec/processor/processor_spec.rb:
it "should have apci" do acpi_ohai = @system[:cpu][:'0'][:flags].include?('acpi') acpi_ohai.should_not be_false, "#{acpi_ohai}, is not acpi, processor, acpi" end
Add informations in description
Example: I want to add bogomips of node:
First we should add information in ohai description. To do this we add in /usr/lib/ruby/vendor_ruby/g5kchecks/ohai/cpu.rb at line 58:
if line =~ /^BogoMIPS/ cpu[:Bogo] = line.chomp.split(": ").last.lstrip end
Then we can retrieve information and add it to the description. To do this we add in /usr/lib/ruby/vendor_ruby/g5kchecks/spec/processor/processor_spec.rb:
it "should have BogoMIPS" do bogo_ohai = @system[:cpu][:Bogo] bogo_ohai.should be_nil, "#{bogo_ohai}, don't have information, processor, bogoMIPS" end
Now you have the information in /tmp/mynode.mysite.grid5000.fr.yaml:
root@graphene-100:/usr/lib/ruby/vendor_ruby/g5kchecks# g5k-checks -m api root@graphene-100:/usr/lib/ruby/vendor_ruby/g5kchecks# grep -C 3 bogo /tmp/graphene-100.nancy.grid5000.fr.yaml ram_size: 16860348416 processor: clock_speed: 2530000000 bogoMIPS: 5053.74 instruction_set: x86-64 model: Intel Xeon version: X3440
Releasing and testing
Tests
Before creating a new standard environment, g5k-checks can be tested on target environments using the jenkins tests: https://intranet.grid5000.fr/jenkins/job/test_g5kchecks
This test can reserve all or the maximum possible nodes (targets cluster-BEST) on each cluster of Grid5000.
It will checkout a (configurable) branch of g5k-checks and test it against a (configurable) branch of the reference-api.
The test will fail if mandatory test fails (i.e. there are entries in /var/lib/oar/checklogs).
Also, the Yaml output of the "-m api" mode will be written to the $HOME/g5k-checks-output directory of the ajenkins user on the target site.
Note: it is possible to change the branches of both reference-repository and g5k-checks for the test by configuring the jenkins test:
cd /srv/jenkins-scripts && ruby -Ilib -rg5kcheckstester -e "G5KChecksTester::new.test('$site_cluster', 'dev_feature', 'dev_feature_refrepo')"
For exemple, this will take the 'dev_feature' branch of g5kcheck and test it against the data present in the 'dev_feature_refrepo' branch of the reference-api.
Release a new version
Once modifications are tested correct, a new version must be released.
Rake tasks are provided to ease this process.
The first step is to increase the version number with those rake tasks: rake package:bump:*
Then the debian package can be created using this task:
rake package:build
And finally the debian package can be built and published on the Grid'5000 apt repository:
rake package:publish