Grid5000 Metadata Bundler: Difference between revisions
(→Usage) |
|||
(13 intermediate revisions by 3 users not shown) | |||
Line 10: | Line 10: | ||
When running experiments on Grid'5000, users generate metadata across multiple services. This metadata is useful for reproducibility purposes or scientific dissemination. | When running experiments on Grid'5000, users generate metadata across multiple services. This metadata is useful for reproducibility purposes or scientific dissemination. | ||
The <code>g5k-metadata-bundler</code> is a | The <code>g5k-metadata-bundler</code> is a tool designed to retrieve metadata across all the different services and bundle them in a single archive. The bundle only retrieves metadata generated by Grid'5000 services, the collection of data generated by the users experiment is beyond the scope of this application. | ||
{{Warning|text=This service is in | {{Warning|text=This service is in alpha version and not yet feature complete. See the section on [[#Planned_Evolutions|Planned_Evolutions]].}} | ||
= Usage = | = Usage = | ||
G5k-metadata-bundler is installed on every site frontend in Grid'5000 | G5k-metadata-bundler is installed on every site frontend in Grid'5000. It can only be executed from the site frontends. | ||
g5k-metadata-bundler -s SITE -j JOBID [-o | g5k-metadata-bundler -s SITE -j JOBID [-o NAME] | ||
-v, --version Print g5k-metadata-bundler version | -v, --version Print g5k-metadata-bundler version | ||
-s, --job-site SITE [MANDATORY] Grid'5000 site from which to extract | -s, --job-site SITE [MANDATORY] Grid'5000 site from which to extract | ||
-j, --job-id JID [MANDATORY] Job id of the OAR jod to extract | -j, --job-id JID [MANDATORY] Job id of the OAR jod to extract | ||
-o, --output | -o, --output NAME Bundle name to use for the directory/archive | ||
Users do '''not''' need to operate the bundler on the same frontend as the site the jobs was executed on. | Users do '''not''' need to operate the bundler on the same frontend as the site the jobs was executed on. | ||
The bundler download all data pertaining to the queried job and bundle in a archive named | The bundler download all data pertaining to the queried job and bundle in a archive named <code class=file>g5k-bundle-</code><code class=replace>SITE</code><code class=file>-</code><code class=replace>JID</code><code class=file>.tar.gz</code> or if an output name has been provided <code class=replace>NAME</code><code class=file>.tar.gz</code>. | ||
The bundle is provided in as a tar.gz archive which can be manipulated by using the following commands: | The bundle is provided in as a tar.gz archive which can be manipulated by using the following commands: | ||
*; Listing | *; Listing | ||
*: <code>tar -tzf | *: <code class=command>tar -tzf </code><code class=replace>NAME</code><code class=file>.tar.gz</code> lists all files contained within the bundle | ||
*; Extraction | *; Extraction | ||
*: <code>tar -xzf | *: <code class=command>tar -xzf </code><code class=replace>NAME</code><code class=file>.tar.gz</code> extracts all files to a directory with the same name as the bundle | ||
Users operating on older versions of Windows might require thrid party software to unpack the bundle. (often 7-zip) | Users operating on older versions of Windows might require thrid party software to unpack the bundle. (often 7-zip) | ||
Line 55: | Line 55: | ||
The bundle contains these different files types: | The bundle contains these different files types: | ||
* '''g5k-oarjob-SITE-JID.json''': Job files | * '''g5k-oarjob-SITE-JID.json''': Job files | ||
*: Contains the information for a given | *: Contains the information for a given OAR job JID at Grid'5000 site SITE such as: | ||
*:* submission, start, and end dates | *:* submission, start, and end dates | ||
*:* user and | *:* user and group (group granting access) of the job | ||
*:* job types and properties | *:* job types and properties | ||
*:* command executed by the | *:* command executed by the job | ||
*:* list of resources attributed to the job | *:* list of resources attributed to the job | ||
*:* | *:* OAR events for the job | ||
*: This information is extracted from the jobs API | *: This information is extracted from the jobs API | ||
* '''g5k-resource-SITE-NODE-VERSION.json''': Resource files | * '''g5k-resource-SITE-NODE-VERSION.json''': Resource files | ||
Line 68: | Line 68: | ||
*:* network, storage, and monitoring devices | *:* network, storage, and monitoring devices | ||
*:* base configuration information | *:* base configuration information | ||
*: The bundle will contain | *: The bundle will contain one such file for each of the nodes involved in a job. | ||
*: This information is extracted from the reference API | *: This information is extracted from the reference API | ||
*: The VERSION of the information contained in this file will match what it was on the day the job was executed. | *: The VERSION of the information contained in this file will match what it was on the day the job was executed. | ||
* '''g5k-refapi-VERSION.json''': Reference API files | * '''g5k-refapi-VERSION.json''': Reference API files | ||
*: Contains a full copy of the reference API at VERSION | *: Contains a full copy of the reference API at VERSION | ||
*: This can be used to | *: This can be used to look up information about nodes not directly used by the bundled oar jobs. | ||
*: The resources files are a subset of this file. | *: The resources files are a subset of this file. | ||
* '''g5k-monitoring-SITE-NODE-JID.json | * '''g5k-monitoring-SITE-NODE-JID.json''': Monitoring files | ||
*: Contains all the monitoring measurements made by Kwollect for a NODE during the oar job JID. | *: Contains all the monitoring measurements made by [[Monitoring_Using_Kwollect|Kwollect]] for a NODE during the oar job JID. | ||
*: The contents will vary depending on how much monitoring | *: The contents will vary depending on how much monitoring was enabled for a given job. See default metrics in [[Monitoring_Using_Kwollect#Metrics_available_in_Grid.275000|Monitoring_Using_Kwollect]]. | ||
*: Often the heaviest files in the bundle | *: Often the heaviest files in the bundle | ||
*: This information is extracted from [[Monitoring_Using_Kwollect|Kwollect]] | *: This information is extracted from [[Monitoring_Using_Kwollect|Kwollect]] | ||
* '''README''': The readme file | * '''README''': The readme file | ||
*: Contains | *: Contains information pertaining to the execution of the bundler such as: | ||
*:* Bundler version | *:* Bundler version | ||
*:* Execution date | *:* Execution date | ||
Line 87: | Line 87: | ||
*:* List of files included in the bundle with a short description | *:* List of files included in the bundle with a short description | ||
User | User will also find at the end of every file a small bundler info segment. This segment contains the date at which the file was generated, warnings raised by the file generation and a list of references indicating how this file relates to other files in the bundle. | ||
= Planned Evolutions = | = Planned Evolutions = | ||
As the | As the bundler is still in alpha version, we welcome comments and feature requests. | ||
The following is a list of | The following is a list of features we are already looking at: | ||
*; Concerning bundle contents | *; Concerning bundle contents | ||
** Bundling multiple jobs in a single archive | ** Bundling multiple jobs in a single archive | ||
** Bundling based on job names | ** Bundling based on job names | ||
** Better management of monitoring information when it is | ** Better management of monitoring information when it is too big for download | ||
** Adding information concerning the standard environment to the bundle | ** Adding information concerning the standard environment to the bundle | ||
** Adding image deployments to the bundle | ** Adding image deployments to the bundle | ||
Line 108: | Line 108: | ||
== Open questions == | == Open questions == | ||
Additionally | Additionally, we would welcome feedback concerning the following open questions: | ||
* What format should be used inside the bundle ? Are json | * What format should be used inside the bundle ? Are json files sufficient or should we convert files to a structured XML ? | ||
* Should we keep bundler segments at the end of every file ? Should | * Should we keep bundler segments at the end of every file ? Should we move all cross references to a separate file ? Not have cross references at all ? |
Latest revision as of 10:27, 22 July 2021
Note | |
---|---|
This page is actively maintained by the Grid'5000 team. If you encounter problems, please report them (see the Support page). Additionally, as it is a wiki page, you are free to make minor corrections yourself if needed. If you would like to suggest a more fundamental change, please contact the Grid'5000 team. |
This page summarize what you need to know about g5k-metadata-bundler.
Introduction
When running experiments on Grid'5000, users generate metadata across multiple services. This metadata is useful for reproducibility purposes or scientific dissemination.
The g5k-metadata-bundler
is a tool designed to retrieve metadata across all the different services and bundle them in a single archive. The bundle only retrieves metadata generated by Grid'5000 services, the collection of data generated by the users experiment is beyond the scope of this application.
Warning | |
---|---|
This service is in alpha version and not yet feature complete. See the section on Planned_Evolutions. |
Usage
G5k-metadata-bundler is installed on every site frontend in Grid'5000. It can only be executed from the site frontends.
g5k-metadata-bundler -s SITE -j JOBID [-o NAME] -v, --version Print g5k-metadata-bundler version -s, --job-site SITE [MANDATORY] Grid'5000 site from which to extract -j, --job-id JID [MANDATORY] Job id of the OAR jod to extract -o, --output NAME Bundle name to use for the directory/archive
Users do not need to operate the bundler on the same frontend as the site the jobs was executed on.
The bundler download all data pertaining to the queried job and bundle in a archive named g5k-bundle-
SITE
-
JID
.tar.gz
or if an output name has been provided NAME
.tar.gz
.
The bundle is provided in as a tar.gz archive which can be manipulated by using the following commands:
- Listing
tar -tzf
NAME
.tar.gz
lists all files contained within the bundle- Extraction
tar -xzf
NAME
.tar.gz
extracts all files to a directory with the same name as the bundle
Users operating on older versions of Windows might require thrid party software to unpack the bundle. (often 7-zip)
Example usage
user@fsophia:~$ g5k-metadata-bundler -s nancy -j 3003030 Running g5k-metadata-bundler for job 3003030 at nancy Downloading https://api.grid5000.fr/stable/sites/nancy/jobs/3003030 Downloading https://api.grid5000.fr/stable/sites/nancy/clusters/graoully/nodes/graoully-1?version=7f6b81c2621c6ed3a4fac632f213436813495755 Downloading https://api.grid5000.fr/stable/?version=7f6b81c2621c6ed3a4fac632f213436813495755&deep=true Downloading https://api.grid5000.fr/stable/sites/nancy/metrics?job_id=3003030&nodes=graoully-1 Generating README Compressing bundle Bundle created at g5k-bundle-nancy-3003030.tar.gz user@fsophia:~$ ls -lh g5k-bundle-nancy-3003030.tar.gz -rw-r--r-- 1 user g5k-users 456K Jul 19 09:50 g5k-bundle-nancy-3003030.tar.gz user@fsophia:~$ tar -tzf g5k-bundle-nancy-3003030.tar.gz g5k-bundle-nancy-3003030/ g5k-bundle-nancy-3003030/g5k-oarjob-nancy-3003030.json g5k-bundle-nancy-3003030/README g5k-bundle-nancy-3003030/g5k-resource-nancy-graoully-1-7f6b81c2621c6ed3a4fac632f213436813495755.json g5k-bundle-nancy-3003030/g5k-monitoring-nancy-graoully-1-3003030.json g5k-bundle-nancy-3003030/g5k-refapi-7f6b81c2621c6ed3a4fac632f213436813495755.json
Bundle contents
The bundle contains these different files types:
- g5k-oarjob-SITE-JID.json: Job files
- Contains the information for a given OAR job JID at Grid'5000 site SITE such as:
- submission, start, and end dates
- user and group (group granting access) of the job
- job types and properties
- command executed by the job
- list of resources attributed to the job
- OAR events for the job
- This information is extracted from the jobs API
- Contains the information for a given OAR job JID at Grid'5000 site SITE such as:
- g5k-resource-SITE-NODE-VERSION.json: Resource files
- Contains information about a single NODE, such as:
- Node architecture, bios, ram, and cpu information
- network, storage, and monitoring devices
- base configuration information
- The bundle will contain one such file for each of the nodes involved in a job.
- This information is extracted from the reference API
- The VERSION of the information contained in this file will match what it was on the day the job was executed.
- Contains information about a single NODE, such as:
- g5k-refapi-VERSION.json: Reference API files
- Contains a full copy of the reference API at VERSION
- This can be used to look up information about nodes not directly used by the bundled oar jobs.
- The resources files are a subset of this file.
- g5k-monitoring-SITE-NODE-JID.json: Monitoring files
- Contains all the monitoring measurements made by Kwollect for a NODE during the oar job JID.
- The contents will vary depending on how much monitoring was enabled for a given job. See default metrics in Monitoring_Using_Kwollect.
- Often the heaviest files in the bundle
- This information is extracted from Kwollect
- README: The readme file
- Contains information pertaining to the execution of the bundler such as:
- Bundler version
- Execution date
- List of warnings and errors that happened during bundling
- List of files included in the bundle with a short description
- Contains information pertaining to the execution of the bundler such as:
User will also find at the end of every file a small bundler info segment. This segment contains the date at which the file was generated, warnings raised by the file generation and a list of references indicating how this file relates to other files in the bundle.
Planned Evolutions
As the bundler is still in alpha version, we welcome comments and feature requests.
The following is a list of features we are already looking at:
- Concerning bundle contents
- Bundling multiple jobs in a single archive
- Bundling based on job names
- Better management of monitoring information when it is too big for download
- Adding information concerning the standard environment to the bundle
- Adding image deployments to the bundle
- Adding information concerning the deployed images
- Concerning bundler operation
- No-compress mode where the bundle is left as a directory containing all files
- Appending new files to an existing bundle
- Reduce memory footprint
- Assess viability of parallel downloads
Open questions
Additionally, we would welcome feedback concerning the following open questions:
- What format should be used inside the bundle ? Are json files sufficient or should we convert files to a structured XML ?
- Should we keep bundler segments at the end of every file ? Should we move all cross references to a separate file ? Not have cross references at all ?