Deployments API

From Grid5000
Revision as of 13:54, 30 May 2016 by Abasu (talk | contribs) (Created page with "{{Maintainer|Anirvan BASU}} {{Portal|User}} {{Portal|Tutorial}} {{Portal|API}} The '''Grid'5000''' platform provides an API for reserving and subsequently, ''deploying'' diffe...")

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

The Grid'5000 platform provides an API for reserving and subsequently, deploying different resources (nodes, storage, etc). This API encapsulates and federates different actions and functionalities that can be performed on the resources. In this tutorial, we present the new version (4.0) of the Grid'5000 API that explains the usage of new functionalities of deployment using kadeploy. The presence of kadeploy itself is transparent to the end-user. She/he can access the features of kadeploy simply from the Grid'5000 API.

Note: For detailed description of the Grid'5000 API and its complete set of features, please consult here.


Theoretical Aspect The theoretical idea is to redirect all kadeploy-related requests coming to the Grid'5000 API, towards the local kadeploy-server at each site and let the kadeploy-server handle the request and send a response back to the user.

Note.png Note

This page is actively maintained by the Grid'5000 team. If you encounter problems, please report them (see the Support page). Additionally, as it is a wiki page, you are free to make minor corrections yourself if needed. If you would like to suggest a more fundamental change, please contact the Grid'5000 team.

Key objectives of this tutorial

The principal objectives of this tutorial are the following:

  1. Install a Hadoop cluster with a Ceph backend, on dedicated nodes,
    1. Run some Hadoop jobs on datasets, generate result datasets.
  2. Backup the entire Hadoop installation onto managed Ceph.
  3. Restore the Hadoop cluster on similar set of dedicated nodes,
    1. Make another reservation for a deployed Ceph on a similar set of dedicated nodes,
    2. Restore the Hadoop cluster with a Ceph backend on this second set of dedicated nodes,
  4. Run the same Hadoop jobs on datasets, generate result datasets.
    1. Result reproducibility: If the backup-restoration was correct, the result datasets should be identical.
    2. Performance reproducibility (optional - to be addressed later) : If the Hadoop backup-restore process was perfect, then the experiment performance (of second Hadoop cluster) should be similar to the first installation, within acceptable limits of tolerance.

Installation of a Hadoop cluster (on dedicated nodes)

In this main step, we first deploy a dedicated Ceph cluster. We then deploy Ceph clients to connect to this dedicated cluster. And finally, we install Hadoop (primarily HDFS) on the Ceph clients.

Deployment of a Ceph cluster (on dedicated nodes)

First we need to prepare our dedicated storage using Ceph backend. We will call it the deployed Ceph cluster.

Please follow the link in the sub-heading to prepare the deployed Ceph cluster. If you have already done the preliminaries earlier (installing Ruby-Cute framework, etc) then you can use the basic command below for preparing your deployed Ceph cluster.

At the CLI on a frontend, type or copy the following commands as-is:

Terminal.png frontend:
   $ gem install --user-install trollop
   $ gem install --user-install ruby-cute -v 0.3 # use version 0.3 until backward compatibility of Ruby-Cute is ascertained.
   $ ./ceph5k/cephDeploy     # Creates and deploys the Ceph cluster

Deployment of Ceph clients (on dedicated nodes)

This step is prepares the dedicated client nodes that will access the deployed Ceph cluster, prepared in the previous step.

Once these clients are prepared, they will serve as the nodes (Master + Slaves) for a Hadoop cluster. Hence, choose these client nodes appropriately for your Big Data experiments.

Please follow the link in the sub-heading to prepare the client nodes. Below is the basic command that can be used:

These tasks are automated on a Grid'5000 frontend using the following command:

Terminal.png frontend:
   $ ./ceph5k/cephClient      # Creates RBD and FS on 'n' clients of deployed Ceph

Deployment of a Hadoop cluster (on dedicated nodes)

This step is intended for installing a dedicated Hadoop cluster on the the dedicated Ceph client nodes. The Hadoop cluster will thus use deployed Ceph cluster as a storage backend. In this regard, the temporary folder for all intermediate files (for e.g. during Map-Reduce operations) will be stored in the Ceph backend.

Please follow the link in the sub-heading to prepare the client nodes. Below is the basic command that can be used:

These tasks are automated on a Grid'5000 frontend using the following command:

Terminal.png frontend:
   $ ./ceph5k/cephHadoop      # Creates Hadoop cluster on dedicated Ceph clients & deployed Ceph cluster

Simple operations of the Hadoop cluster

In this main step, we first deploy a dedicated Ceph cluster. We then deploy Ceph clients to connect to this dedicated cluster. And finally, we install Hadoop (primarily HDFS) on the Ceph clients.

Smoke-testing the HDFS functionality

Once the Hadoop installation is complete, we will have a full-fledged Hadoop Distributed File System (HDFS) up and running on your Ceph clients, in Master-Slave configuration. Each of these nodes, will access the Ceph backend for storage.

On the Master node of your Hadoop installation, login as root@master-node and run the following command to check the Hadoop daemons that are running:

Terminal.png root@master-node:
   # jps

On your console at the monitor node, you will notice the output giving the following daemons.

Terminal.png root@master-node:
   2834 Jps
   1888 NameNode
   2090 SecondaryNameNode
   2276 ResourceManager
   2571 JobHistoryServer

Copying external datasets to Hadoop cluster

This step is prepares the dedicated client nodes that will access the deployed Ceph cluster, prepared in the previous step.

Once these clients are prepared, they will serve as the nodes (Master + Slaves) for a Hadoop cluster. Hence, choose these client nodes appropriately for your Big Data experiments.

Please follow the link in the sub-heading to prepare the client nodes. Below is the basic command that can be used:

These tasks are automated on a Grid'5000 frontend using the following command:

Terminal.png frontend:
   $ ./ceph5k/cephClient      # Creates RBD and FS on 'n' clients of deployed Ceph

The next step is to copy some standard Grid'5000 datasets into the dedicated Hadoop cluster. We do it in 2 sub-steps as follows:

  1. Copy a Grid'5000 dataset to some client nodes (e.g. Master node) in your dedicated Ceph cluster
  2. Load this dataset to the HDFS, so it is visible to your entire Hadoop cluster (from Master as well as Slave nodes)

Note: If you have your own datasets stored on another mounted drive, you are free to load that dataset directly onto your Hadoop cluster (just step 2 above).

Copying Grid'5000 datasets to dedicated Ceph cluster

On the Master node of your Hadoop installation, login as root@master-node and run the following commands to copy some datasets:

Terminal.png root@master-node:
   # rsync -avzP userid@rennes:/home/abasu/public/ceph-data/google-3grams.csv /mnt/ceph-depl/    # copies a Grid'5000 dataset to your Ceph deployed cluster

This step does not necessarily makes your dataset visible to all nodes of your dedicated Ceph-Hadoop cluster. For e.g. if you login to any slave node as root@slave-node and check for the above dataset in the same directory there /mnt/ceph-depl/, you will NOT see the dataset. To be visible to ALL nodes in the dedicated cluster, we need to load the dataset into the Hadoop cluster.

Loading the dataset to Hadoop cluster (HDFS)

On the Master node of your Hadoop installation, login as root@master-node and run the following commands to create a Hadoop directory structure and load some datasets to your HDFS:

Terminal.png root@master-node:
   # hdfs dfs -mkdir /mnt/
   # hdfs dfs -mkdir /mnt/ceph-depl
   # hdfs dfs -copyFromLocal /mnt/ceph-depl/google-3grams.csv /mnt/ceph-depl/

Note: We are using the same directory structure in the Hadoop cluster as in the dedicated Ceph cluster. This is just to facilitate easier typing of commands. You can choose your own directory structure in Hadoop.

Once the above command has terminated, the above dataset is visible on your entire Hadoop cluster. If you login to any slave node as root@slave-node and check for the above dataset in the Hadoop directory /mnt/ceph-depl/, you can see the dataset. You can even download the dataset from Hadoop directory to your slave node directory. Use the following commands below

Terminal.png root@slave-node:
   # hdfs dfs -ls /mnt/ceph-depl      # List dataset in HDFS
   Found 1 items
   -rw-r--r--   3 root supergroup 2178490368 2016-05-12 09:09 /mnt/ceph-depl/google-3grams.csv
   # hdfs dfs -copyToLocal /mnt/ceph-depl/google-3grams.csv /mnt/ceph-depl/    # Download dataset from HDFS to slave-node drive
   # ls -al /mnt/ceph-depl/     # List dataset in slave-node drive
   total 2127464
   drwxr-xr-x 4 root root       4096 May 12 11:24 .
   drwxr-xr-x 3 root root       4096 May 12 10:49 ..
   -rw-r--r-- 1 root root 2178490368 May 12 11:24 google-3grams.csv
   drwxr-xr-x 3 root root       4096 May 12 10:51 hadoop
   drwx------ 2 root root      16384 May 12 10:49 lost+found

Running a simple WordCount programme on the Hadoop cluster

On the Master node of your Hadoop installation, login as root@master-node and run the following commands to run a wordcount programme on your Hadoop cluster:

Terminal.png root@master-node:
   # hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar wordcount /mnt/ceph-depl/google-3grams.csv /mnt/ceph-depl/wordcount.output

You will get output similar to the following:

Terminal.png root@master-node:
   16/05/12 09:26:40 INFO client.RMProxy: Connecting to ResourceManager at paravance-32.rennes.grid5000.fr/172.16.96.32:8032
   16/05/12 09:26:40 INFO input.FileInputFormat: Total input paths to process : 1
   16/05/12 09:26:41 INFO mapreduce.JobSubmitter: number of splits:17
   16/05/12 09:26:41 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1463043071145_0001
   16/05/12 09:26:41 INFO impl.YarnClientImpl: Submitted application application_1463043071145_0001
   16/05/12 09:26:41 INFO mapreduce.Job: The url to track the job: http://paravance-32.rennes.grid5000.fr:8088/proxy/application_1463043071145_0001/
   16/05/12 09:26:41 INFO mapreduce.Job: Running job: job_1463043071145_0001
   16/05/12 09:26:48 INFO mapreduce.Job: Job job_1463043071145_0001 running in uber mode : false
   16/05/12 09:26:48 INFO mapreduce.Job:  map 0% reduce 0%
   16/05/12 09:26:59 INFO mapreduce.Job:  map 4% reduce 0%
   ...
   16/05/12 09:27:02 INFO mapreduce.Job:  map 19% reduce 0%
   16/05/12 09:27:13 INFO mapreduce.Job:  map 37% reduce 2%
   ...
   16/05/12 09:27:34 INFO mapreduce.Job:  map 96% reduce 29%
   16/05/12 09:27:38 INFO mapreduce.Job:  map 100% reduce 31%
   16/05/12 09:27:40 INFO mapreduce.Job:  map 100% reduce 100%
   16/05/12 09:27:41 INFO mapreduce.Job: Job job_1463043071145_0001 completed successfully
   16/05/12 09:27:41 INFO mapreduce.Job: Counters: 50
      File System Counters
               FILE: Number of bytes read=30256705
               FILE: Number of bytes written=42641064
               FILE: Number of read operations=0
               FILE: Number of large read operations=0
               FILE: Number of write operations=0
               HDFS: Number of bytes read=2178558284
               HDFS: Number of bytes written=3248232
               HDFS: Number of read operations=54
               HDFS: Number of large read operations=0
               HDFS: Number of write operations=2
      Job Counters 
               Killed map tasks=2
               Launched map tasks=19
               Launched reduce tasks=1
               Data-local map tasks=19
               Total time spent by all maps in occupied slots (ms)=711159
               Total time spent by all reduces in occupied slots (ms)=35832
               Total time spent by all map tasks (ms)=711159
               Total time spent by all reduce tasks (ms)=35832
               Total vcore-milliseconds taken by all map tasks=711159
               Total vcore-milliseconds taken by all reduce tasks=35832
               Total megabyte-milliseconds taken by all map tasks=728226816
               Total megabyte-milliseconds taken by all reduce tasks=36691968
      Map-Reduce Framework
               Map input records=79537570
               Map output records=534334266
               Map output bytes=4306267666
               Map output materialized bytes=10258589
               Input split bytes=2380
               Combine input records=535817569
               Combine output records=2213258
               Reduce input groups=272792
               Reduce shuffle bytes=10258589
               Reduce input records=729955
               Reduce output records=272792
               Spilled Records=2967740
               Shuffled Maps =17
               Failed Shuffles=0
               Merged Map outputs=17
               GC time elapsed (ms)=8927
               CPU time spent (ms)=817080
               Physical memory (bytes) snapshot=4880969728
               Virtual memory (bytes) snapshot=15618744320
               Total committed heap usage (bytes)=3740794880
      Shuffle Errors
               BAD_ID=0
               CONNECTION=0
               IO_ERROR=0
               WRONG_LENGTH=0
               WRONG_MAP=0
               WRONG_REDUCE=0
      File Input Format Counters 
               Bytes Read=2178555904
      File Output Format Counters 
               Bytes Written=3248232

Backing up the Hadoop cluster

Before backing up a Hadoop cluster from a deployed Ceph to managed Ceph, it is necessary that the Hadoop daemons (HDFS, YARN, MapReduce, ...) are correctly stopped. This is for two reasons:

  1. To ensure the latest data (including intermediate data files and directories) are backed up
  2. To ensure that there are no locks on files or directories, imposed by the Hadoop daemons

Copying the Hadoop installation (to managed Ceph)

This is done at the level of Ceph backend that is installed on dedicated nodes. We note that each Hadoop node (Master or Slave) is a Ceph client accessing the Ceph deployed cluster. Thus each client has its own Ceph pool and block device (RBD - Rados Block Device). Hence, the concept of backing up the hadoop cluster is the following:

  1. Each Ceph client (Hadoop node) does a backup of its pool and RBD
  2. The Ceph RBDs are copied to the user's managed Ceph account (in another pool).
    1. As a result, all the HDFS datasets stored in the relevant RBDs are also copied to managed Ceph.
  3. However, that is not all. There is still the Hadoop installation to be copied. It includes:
    1. All Hadoop binaries
    2. All Hadoop configuration files.
  4. These files can be bundled and put into another Ceph RBD and copied to Managed Ceph.

On each client node of the deployed Ceph cluster, login as root@ceph-client and run the following commands to copy some datasets (the following steps will be automated using a ceph5k script) :

Terminal.png root@ceph-client:
   # rbd snap create user_pool/user_image@user_snap    # Creates a snapshot of RBD user_image
   # rbd export user_cpool/user_image@user_snap - pipe ssh userid@ceph.site.grid5000.fr rbd import - user_pool/user_image        # Exports snapshot to managed Ceph pool
   # rbd export user_cpool/user_image@user_snap2       # Creates a new snapshot of RBD user_image
   # rbd export-diff --from-snap user_snap user_pool/user_image@user_snap2 - pipe ssh userid@ceph.site.grid5000.fr rbd import-diff - user_pool/user_image    # Updates the snapshot of RBD user_image on managed Ceph

Restoring a Hadoop cluster (on dedicated nodes)