Ceph

From Grid5000
Jump to navigation Jump to search
Note.png Note

This page is actively maintained by the Grid'5000 team. If you encounter problems, please report them (see the Support page). Additionally, as it is a wiki page, you are free to make minor corrections yourself if needed. If you would like to suggest a more fundamental change, please contact the Grid'5000 team.

Ceph is a distributed object store and file system designed to provide excellent performance, reliability and scalability. - See more at: http://ceph.com/

This pages gives information with regards to the managed Ceph service provided by Grid'5000.

About Ceph

The following documentation show the architecture: https://docs.ceph.com/en/latest/architecture/

Grid'5000 Deployment

Sites Size Configuration Rados RBD CephFS RadosGW
Rennes ~ 9TB 16 OSDs on 4 nodes Check.png Check.png Fail.png Fail.png
Nantes ~ 7TB 12 OSDs on 4 nodes Check.png Check.png Fail.png Fail.png

Configuration

Generate your key

In order to access to the object store you will need a Cephx key. See : https://api.grid5000.fr/sid/sites/rennes/storage/ceph/ui/

Click on Generate a Cephx Key. A new key is generated.

Your key is now available from the frontends :

[client.jdoe]
  key = AQBwknVUwAPAIRAACddyuVTuP37M55s2aVtPrg==

Note : Replace jdoe by your login and rennes by nantes if your experiment is at Nantes.

Create/Update/Delete Ceph pools

Requirement : Generate your key

Manage your Ceph pools from the Grid'5000 Ceph frontend :

Currently each users are limited to 4 pools.

Ceph g5k frontend.png

Replication size

  • 1 : No replication (not secured, most efficient for write operations, not recommended : one disk crash and all datas are lost)
  • n : One primary object + n-1 replicas (more security, less efficient for write operations)

You can edit the replication size by clicking on the replication size value. If the new value is greater than the old value, objects will automatically be replicated. Depending the amount of objects this operation will stress the Ceph cluster and take more or less time.

Quota

Quota is currently limited to 2TB per pools. You can change this quota at any time by clicking on the quota value, so prefer to use a small/realistic quota first.

Expiration date

The default expiration date is limited to 2 months. You can extend the expiration date at any time by clicking on the expiration date value.

Configure Ceph on clients

On the frontend or a node with standard environment

Create a ceph configuration file ~/.ceph/config :

  • For Nantes :
[global]
mon host = ceph0,ceph1,ceph2
keyring = /home/<login>/.ceph/ceph.client.<login>.keyring
  • For Rennes :
[global]
mon host = ceph0,ceph1,ceph2
keyring = /home/<login>/.ceph/ceph.client.<login>.keyring

Create a ceph keyring file ~/.ceph/ceph.client.$USER.keyring with your keyring :

Terminal.png frontend:

Note : Replace rennes by nantes if your are at Nantes.

export CEPH_CONF=~/.ceph/config
export CEPH_ARGS="--id $USER"

Note : maybe a good idea to export CEPH_CONF and CEPH_ARGS in your ~/.bashrc or ~/.zshrc file.

Test

$ ceph health
HEALTH_OK
$ ceph osd tree
# id	weight	type name	up/down	reweight
-1	8.608	root default
-6	8.608		datacenter rennes
-2	2.152			host ceph0
0	0.538				osd.0	up	1
1	0.538				osd.1	up	1
2	0.538				osd.2	up	1
3	0.538				osd.3	up	1
-3	2.152			host ceph1
4	0.538				osd.4	up	1
5	0.538				osd.5	up	1
6	0.538				osd.6	up	1
7	0.538				osd.7	up	1
-4	2.152			host ceph2
8	0.538				osd.8	up	1
9	0.538				osd.9	up	1
10	0.538				osd.10	up	1
11	0.538				osd.11	up	1
-5	2.152			host ceph3
12	0.538				osd.12	up	1
14	0.538				osd.14	up	1
15	0.538				osd.15	up	1
13	0.538				osd.13	up	1

Usage

Rados Object Store access

Requierement : Create a Ceph poolConfigure Ceph on client

From command line

Put an object into a pool

Terminal.png node:
rados --pool jdoe_objects put <objectname> <infile>

List objects of a pool

Terminal.png node:
rados --pool jdoe_objects ls

Get object from a pool

Terminal.png node:
rados --pool jdoe_objects get <objectname> <outfile>

Remove an object

Terminal.png node:
rados --pool jdoe_objects rm <objectname>

Usage informations

Terminal.png node:
rados --pool jdoe_objects df
pool name       category                 KB      objects       clones     degraded      unfound           rd        rd KB           wr        wr KB
jdoe_objects -                    1563027            2            0            0           0            0            0          628      2558455
  total used       960300628       295991
  total avail     7800655596
  total space     9229804032

From your application (C/C++, Python, Java, Ruby, PHP...)

See : https://docs.ceph.com/en/latest/rados/api/librados-intro/?highlight=librados

Rados benchmarks

On a 1G ethernet client

Terminal.png frontend:
oarsub -I -p "cluster='parapluie'"
Write (16 concurrent operations)
Terminal.png node:
rados --pool pmorillo_bench bench 60 write --no-cleanup
  sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
    0       0         0         0         0         0         -         0
    1      16        39        23   91.9669        92   0.91312  0.429879
    2      16        70        54   107.973       124  0.679348  0.458774
    3      16        96        80   106.644       104  0.117995   0.48914
    4      16       126       110    109.98       120  0.207923  0.520003
    5      16       154       138    110.38       112  0.370721  0.521431
    6      16       180       164   109.314       104  0.293496  0.533345
...
Total time run:        60.321552
Total reads made:     1691
Read size:            4194304
Bandwidth (MB/sec):    112.132 

Average Latency:       0.569955
Max latency:           4.06499
Min latency:           0.055024
Read
Terminal.png node:
rados --pool pmorillo_bench bench 60 seq
Maintaining 16 concurrent writes of 4194304 bytes for at least 60 seconds.
Object prefix: benchmark_data_parapluie-31.rennes.grid5000._25776
  sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
    0       0         0         0         0         0         -         0
    1      16        39        23     91.98        92  0.553582  0.434956
    2      16        66        50   99.9807       108  0.471024  0.486303
    3      16        95        79   105.314       116  0.654954  0.510875
    4      16       124       108   107.981       116  0.522723  0.535851
    5      16       151       135   107.981       108  0.244908  0.547624
    6      16       180       164   109.314       116  0.239384   0.54942
...
Total time run:         60.478634
Total writes made:      1694
Write size:             4194304
Bandwidth (MB/sec):     112.040 

Stddev Bandwidth:       15.9269
Max bandwidth (MB/sec): 124
Min bandwidth (MB/sec): 0
Average Latency:        0.571046
Stddev Latency:         0.354297
Max latency:            2.62603
Min latency:            0.165878

On a 10G ethernet client

Terminal.png frontend:
oarsub -I -p "cluster='paranoia'"
Write (16 concurrent operations)
  • Pool replication size : 1
Terminal.png node:
rados --pool pmorillo_bench bench 60 write --no-cleanup
Maintaining 16 concurrent writes of 4194304 bytes for at least 60 seconds.
Object prefix: benchmark_data_paranoia-3.rennes.grid5000.fr_5626
  sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
    0       0         0         0         0         0         -         0
    1      16       146       130   519.915       520  0.488794  0.103537
    2      16       247       231   461.933       404  0.164705  0.126076
    3      16       330       314   418.612       332  0.036563  0.148763
    4      16       460       444   443.941       520  0.177378  0.141696
...
Total time run:         60.579488
Total writes made:      5965
Write size:             4194304
Bandwidth (MB/sec):     393.863 

Stddev Bandwidth:       83.789
Max bandwidth (MB/sec): 520
Min bandwidth (MB/sec): 0
Average Latency:        0.162479
Stddev Latency:         0.195071
Max latency:            1.7959
Min latency:            0.033313
Read
Terminal.png node:
rados --pool pmorillo_bench bench 60 seq
  sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
    0       0         0         0         0         0         -         0
    1      16       288       272   1087.78      1088  0.028679 0.0556937
    2      16       569       553   1105.82      1124  0.049299 0.0569641
    3      16       850       834   1111.85      1124  0.024369 0.0566906
    4      16      1127      1111   1110.86      1108  0.024855 0.0570628
    5      16      1406      1390   1111.86      1116  0.182567 0.0568681
...
Total time run:        21.377795
Total reads made:     5965 
Read size:            4194304
Bandwidth (MB/sec):    1116.111 

Average Latency:       0.0573023
Max latency:           0.270447
Min latency:           0.011937

RBD (Rados Block Device)

Requierement : Create a Ceph poolConfigure Ceph on client

Create a Rados Block Device

Terminal.png frontend:
rbd --pool jdoe_rbd create <rbd_name> --size <MB> --image-format=1
Warning.png Warning

Ceph server release is older than frontend. For the moment we have to force "--image-format=1". Which create the warning "rbd: image format 1 is deprecated"

Terminal.png frontend:
rbd --pool jdoe_rbd ls
Terminal.png frontend:
rbd --pool jdoe_rbd info <rbd_name>

Exemple :

$ rbd -p pmorillo_rbd create datas --size 4096 --image-format=1
rbd: image format 1 is deprecated
$ rbd -p pmorillo_rbd ls
datas
$ rbd -p pmorillo_rbd info datas
rbd image 'datas':
	size 4096 MB in 1024 objects
	order 22 (4096 kB objects)
	block_name_prefix: rb.0.604e.238e1f29
	format: 1

Note : rbd -p pmorillo_rbd info datas is equivalent to rbd info pmorillo_rbd/datas

Create filesystem and mount RBD

Terminal.png frontend:
oarsub -I -t deploy
Terminal.png frontend:
kadeploy3 -e debian11-x64-base -f $OAR_NODEFILE -k
Terminal.png frontend:
ssh root@deployed_node
Terminal.png node:
apt install ceph
Terminal.png node:
modprobe rbd
Terminal.png node:
modinfo rbd

Create Ceph configuration file /etc/ceph/ceph.conf :

[global]
mon host = ceph0,ceph1,ceph2

Get CephX key from the frontend :

frennes $ curl -k https://api.grid5000.fr/sid/sites/rennes/storage/ceph/auths/$USER.keyring | ssh root@deployed_node "cat - > /etc/ceph/ceph.client.$USER.keyring"
Terminal.png node:
rbd --id jdoe --pool jdoe_pool map <rbd_name>

Exemple :

rbd --id pmorillo map pmorillo_rbd/datas
Terminal.png node:
rbd --id jdoe --pool jdoe_pool showmapped

Exemple :

rbd --id pmorillo showmapped
id pool         image snap device
0  pmorillo_rbd datas -    /dev/rbd0
Terminal.png node:
mkfs.ext4 /dev/rbd0
Terminal.png node:
mkdir /mnt/rbd
Terminal.png node:
mount /dev/rbd0 /mnt/rbd
Terminal.png node:
df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda3        15G  1.6G   13G  11% /
...
/dev/sda5       525G   70M  498G   1% /tmp
/dev/rbd0       3.9G  8.0M  3.6G   1% /mnt/rbd

Note : To map rbd at boot time, use /etc/ceph/rbdmap file :

# RbdDevice		Parameters 
#poolname/imagename	id=client,keyring=/etc/ceph/ceph.client.keyring

Resize, snapshots, copy, etc...

See :

QEMU/RBD

Requierement : Create a Ceph poolConfigure Ceph on client

Convert a qcow2 file into RBD

Terminal.png frennes:
oarsub -I
Terminal.png node:
qemu-img convert -f qcow2 -O raw /grid5000/virt-images/debian10-x64-base.qcow2 rbd:pool_name/debian10:id=jdoe
Terminal.png node:
rbd --id jdoe --pool pool_name ls
debian7

Start KVM virtual machine from a Rados Block Device

Terminal.png node:
screen kvm -m 1024 -drive format=raw,file=rbd:pool_name/debian10:id=jdoe -nographic