Ceph
Introduction
Ceph is a storage project that is sponsored by The Linux Foundation. It has developed a storage system that uses Reliable Autonomic Distributed Object Store (RADOS) to provide scalable, fast, and reliable software-defined storage by storing files as objects and calculating their location on the fly. Failovers will even happen automatically so no data is lost. By default, there are 3 replicas of each file stored on an OSD.
Vocabulary:
Object Storage Device (OSD) = The device that stores data.
OSD Daemon = Handles storing all user data as objects.
Ceph Block Device (RBD) = Provides a block device over the network, similar in concept to iSCSI.
Ceph Object Gateway = A RESTful API which works with Amazon S3 and OpenStack Swift.
Ceph Monitors (MONs) = Store and provide a map of data locations.
Ceph Metadata Server (MDS) = Provides metadata about file system hierarchy for CephFS. This is not required for RBD or RGW.
Ceph File System (CephFS) = A POSIX-compliant distributed file system with unlimited size.
Controlled Replication Under Scalable Hash (CRUSH) = Uses an algorithm to provide metadata about an object’s location.
Placement Groups (PGs) = Object storage data.
Ceph monitor nodes have a master copy of a cluster map. This contains 5 separate maps that have information about data location and the cluster’s status. If an OSD fails, the monitor daemon will automatically reorganize everything and provided end-user’s with an updated cluster map.
Cluster map:
Monitor map = The cluster fsid (uuid), position, name, address and port of each monitor server.
$ sudo ceph mon dump
OSD map = The cluster fsid, available pools, PG numbers, and OSDs current status.
$ sudo ceph osd dump
PG map = PG version, PG ID, ratios, and data usage statistics.
$ sudo ceph pg dump
CRUSH map = Storage devices, physical locations, and rules for storing objects. It is recommended to tweak this for production clusters.
MDS map
$ sudo ceph fs dump
When the end-user asks for a file, that name is combined with it’s PG ID and then CRUSH hashes it to find the exact location of it on all of the OSDs. The master OSD for that file serves the content. [1]
For OSD nodes, it is recommend that the operating system is on two disks in a RAID 1. All of the over disks can be used for OSD or journal/metadata services.
As of Luminous release, the new mgr
(managers) monitoring service is required. It helps to collect metrics about the cluster. It should be running on all of the monitor nodes. https://docs.ceph.com/docs/luminous/release-notes/
The current back-end for handling data storage is FileStore. When data is written to a Ceph OSD, it is first fully written to the OSD journal. This is a separate partition that can be on the same drive or a different drive. It is faster to have the journal on an SSD if the OSD drive is a regular spinning-disk drive.
The new BlueStore back-end was released as a technology preview in the Ceph Jewel release. In the Luminous release, it had became the default data storage handler. This helps to overcome the double write penalty of FileStore by writing the the data to the block device first and then updating the metadata of the data’s location. That means that in some cases, BlueStore is twice as fast as FileStore. All of the metadata is also stored in the fast RocksDB key-value store. File systems are no longer required for OSDs because BlueStore writes data directly to the block device of the hard drive. [2] It is recommended to have a 3:1 ratio for OSDS to BlueStore journals/metadata. The metadata drives should be a fast storage medium such as an SSD or NVMe.
ceph-volume
is a tool for automagically figuring out which disks to use for journals/metadata or OSDs. It replaces ceph-disk and supports BlueStore. It does not support loopback devices. The logic it normally follows is:
1 OSD per HDD
2 OSDs per SSD
HDD + SSD = HDD OSDs and SSD metadata
The optimal number of PGs is found be using this equation (replacing the number of OSD daemons and how many replicas are set). This number should be rounded up to the next power of 2. PGCalc is an online utility/calculator to help automatically determine this value.
Total PGs = (<NUMBER_OF_OSDS> * 100) / <REPLICA_COUNT> / <NUMBER_OF_POOLS>
Example:
OSD count = 30, replica count = 3, pool count = 1
Run the calculations: 1000 = (30 * 100) / 3 / 1
Find the next highest power of 2: 2^10 = 1024
1000 =< 1024
Total PGs = 1024
With Ceph’s configuration, the Placement Group for Placement purpose (PGP) should be set to the same PG number. PGs are the number of number of times a file should be split. This change only makes the Ceph cluster rebalance when the PGP count is increased.
New pools:
File: /etc/ceph/ceph.conf
[global]
osd pool default pg num = <OPTIMAL_PG_NUMBER>
osd pool default pgp num = <OPTIMAL_PG_NUMBER>
Existing pools:
$ sudo ceph osd pool set <POOL> pg_num <OPTIMAL_PG_NUMBER>
$ sudo ceph osd pool set <POOL> pgp_num <OPTIMAL_PG_NUMBER>
Cache pools can be configured used to cache files onto faster drives. When a file is continually being read, it will be copied to the faster drive. When a file is first written, it will go to the faster drives. After a period of time of lesser use, those files will be moved to the slow drives. [3]
For testing, the “cephx” authentication protocols can temporarily be
disabled. This will require a restart of all of the Ceph services.
Re-enable cephx
by setting these values from “none” to “cephx.” [4]
File: /etc/ceph/ceph.conf
[global]
auth cluster required = none
auth service required = none
auth client required = none
Releases
Starting with the Luminous 12 release, all versions are supported for two years. A new release comes out every year. [13]
<RELEASE_NAME> <RELEASE_NUMBER> = <RELEASE_DATE>
Luminous 12 = 2017-02
Mimic 13 = 2018-05
Nautilus 14 = 2019-03
Octopus 15 = 2020-03
Installation
Ceph Requirements:
Fast CPU for OSD and metadata nodes.
1GB RAM per 1TB of Ceph OSD storage, per OSD daemon.
1GB RAM per monitor daemon.
1GB RAM per metadata daemon.
An odd number of monitor nodes (starting at least 3 for high availability and quorum). [5]
Quick
This example demonstrates how to deploy a 3 node Ceph cluster with both the monitor and OSD services. In production, monitor servers should be separated from the OSD storage nodes.
Create a new Ceph cluster group, by default called “ceph.”
$ sudo ceph-deploy new <SERVER1>
Install the latest LTS release for production environments on the specified servers. SSH access is required.
$ sudo ceph-deploy install --release jewel <SERVER1> <SERVER2> <SERVER3>
Initialize the first monitor.
$ sudo ceph-deploy mon create-initial <SERVER1>
Install the monitor service on the other nodes.
$ sudo ceph-deploy mon create <SERVER2> <SERVER3>
List the available hard drives from all of the servers. It is recommended to have a fully dedicated drive, not a partition, for each Ceph OSD.
$ sudo ceph-deploy disk list <SERVER1> <SERVER2> <SERVER3>
Carefully select the drives to use. Then use the “disk zap” arguments to zero out the drive before use.
$ sudo ceph-deploy disk zap <SERVER1>:<DRIVE> <SERVER2>:<DRIVE> <SERVER3>:<DRIVE>
Prepare and deploy the OSD service for the specified drives. The default file system is XFS, but Btrfs is much feature-rich with technologies such as copy-on-write (CoW) support.
$ sudo ceph-deploy osd create --fs-type btrfs <SERVER1>:<DRIVE> <SERVER2>:<DRIVE> <SERVER3>:<DRIVE>
Verify it’s working.
$ sudo ceph status
[6]
ceph-ansible (<= Octopus)
The ceph-ansible project is used to deploy and update Ceph clusters using Ansible. It is deprecated and replaced by cephadm.
$ sudo git clone https://github.com/ceph/ceph-ansible/
$ sudo cd ceph-ansible/
Configure the Ansible inventory hosts file. This should contain the SSH connection details to access the relevant servers.
Inventory hosts:
[mons] = Monitors for tracking and locating object storage data.
[osds] = Object storage device nodes for storing the user data.
[mdss] = Metadata servers for CephFS. (Optional)
[rwgs] = RADOS Gateways for Amazon S3 or OpenStack Swift object storage API support. (Optional)
Example inventory:
ceph_monitor_01 ansible_host=192.168.20.11
ceph_monitor_02 ansible_host=192.168.20.12
ceph_monitor_03 ansible_host=192.168.20.13
ceph_osd_01 ansible_host=192.168.20.101 ansible_port=2222
ceph_osd_02 ansible_host=192.168.20.102 ansible_port=2222
ceph_osd_03 ansible_host=192.168.20.103 ansible_port=2222
[mons]
ceph_monitor_01
ceph_monitor_02
ceph_monitor_03
[osds]
ceph_osd_01
ceph_osd_02
ceph_osd_03
Copy the sample configurations and modify the variables.
$ sudo cp site.yml.sample site.yml
$ sudo cd group_vars/
$ sudo cp all.yml.sample all.yml
$ sudo cp mons.yml.sample mons.yml
$ sudo cp osds.yml.sample osds.yml
Common variables:
group_vars/all.yml = Global variables.
ceph_origin = Specify how to install the Ceph software.
upstream = Use the official repositories.
Upstream related variables:
ceph_dev: Boolean value. Use a development branch of Ceph from GitHub.
ceph_dev_branch = The exact branch or commit of Ceph from GitHub to use.
ceph_stable = Boolean value. Use a stable release of Ceph.
ceph_stable_release = The release name to use. The LTS “jewel” release is recommended.
distro = Use repositories already present on the system. ceph-ansible will not install Ceph repositories with this method, they must already be installed.
ceph_release_num = If “ceph_stable” is not defined, use any specific major release number.
9 = infernalis
10 = jewel
11 = kraken
group_vars/osds.yml = Object storage daemon variables.
devices = A list of drives to use for each OSD daemon.
osd_auto_discovery = Boolean value. Default: false. Instead of manually specifying devices to use, automatically use any drive that does not have a partition table.
OSD option #1:
journal_collocation = Boolean value. Default: false. Use the same drive for journal and data storage.
OSD option #2:
raw_multi_journal = Boolean value. Default: false. Store journals on different hard drives.
raw_journal_devices = A list of devices to use for journaling.
OSD option #3:
osd_directory = Boolean value. Default: false. Use a specified directory for OSDs. This assumes that the end-user has already partitioned the drive and mounted it to
/var/lib/ceph/osd/<OSD_NAME>
or a custom directory.osd_directories = The directories to use for OSD storage.
OSD option #4:
bluestore: Boolean value. Default: false. Use the new and experimental BlueStore file store that can provide twice the performance for drives that have both a journal and OSD for Ceph.
OSD option #5:
dmcrypt_journal_collocation = Use Linux’s “dm-crypt” to encrypt objects when both the journal and data are stored on the same drive.
OSD option #6:
dmcrypt_dedicated_journal = Use Linux’s “dm-crypt” to encrypt objects when both the journal and data are stored on the different drives.
Finally, run the Playbook to deploy the Ceph cluster.
$ sudo ansible-playbook -i production site.yml
[7]
CRUSH Map
CRUSH maps are used to keep track of OSDs, physical locations of servers, and it defines how to replicate objects.
These maps are divided into four main parts:
Devices = The list of each OSD daemon in the cluster.
Bucket Types = Definitions that can group OSDs into groups with their own location and weights based on servers, rows, racks, datacenters, etc.
Bucket Instances = A bucket instance is created by specifying a bucket type and one or more OSDs.
Rules = Rules can be defined to configure which bucket instances will be used for reading, writing, and/or replicating data.
A binary of the configuration must be saved and then decompiled before changes can be made. Then the file must be recompiled for the updates to be loaded.
$ sudo ceph osd getcrushmap -o <NEW_COMPILED_FILE>
$ sudo crushtool -d <NEW_COMPILED_FILE> -o <NEW_DECOMPILED_FILE>
$ sudo vim <NEW_DECOMPILED_FILE>`
$ sudo crushtool -c <NEW_DECOMPILED_FILE> -o <UPDATED_COMPILED_FILE>
$ sudo ceph osd setcrushmap -i <UPDATED_COMPILED_FILE>
Devices
Devices must follow the format of device <COUNT> <OSD_NAME>
. These
are automatically generated but can be adjusted and new nodes can be
manually added here.
# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
Bucket Types
Bucket types follow a similar format of type <COUNT> <TYPE_NAME>
.
The name of the type can be anything. The higher numbered type always
inherits the lower numbers. The default types include:
# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root
Bucket Instances
Bucket instances are used to group OSD configurations together. Typically these should define physical locations of the OSDs.
<CUSTOM_BUCKET_TYPE> <UNIQUE_BUCKET_NAME> {
id <UNIQUE_NEGATIVE_NUMBER>
weight <FLOATING_NUMBER>
alg <BUCKET_TYPE>
hash 0
item <OSD_NAME> weight <FLOATING_NUMBER>
}
<CUSTOM_BUCKET_TYPE>
= Required. This should be one of the user-defined bucket types.<UNIQUE_BUCKET_NAME>
= Required. A unique name that describes the bucket.id = Required. A unique negative number to identify the bucket.
weight = Optional. A floating/decimal number for all of the weight of all of the OSDs in this bucket.
alg = Required. Choose which Ceph bucket type/method that is used to read and write objects. This should not be confused with the user-defined bucket types.
Uniform = Assumes that all hardware in the bucket instance is exactly the same so all OSDs receive the same weight.
List = Lists use the RUSH algorithm to read and write objects in sequential order from the first OSD to the last. This is best suited for data that does not need to be deleted (to avoid rebalancing).
Tree = The binary search tree uses the RUSH algorithm to efficiently handle larger amounts of data.
Straw = A combination of both “list” and “tree.” One of the two bucket types will randomly be selected for operations. Replication is fast but rebalancing will be slow.
hash = Required. The hashing algorithm used by CRUSH to lookup and store files. As of the Jewel release, only option “0” for “rjenkins1” is supported.
item = Optional. The OSD name and weight for individual OSDs. This is useful if a bucket instance has hard drives of different speeds.
Rules
By modifying the CRUSH map, replication can be configured to go to a different drive, server, chassis, row, rack, datacenter, etc.
rule <RULE_NAME> {
ruleset <RULESET>
type <RULE_TYPE>
min_size <MINIMUM_SIZE>
max_size <MAXIMUM_SIZE>
step take <BUCKET_INSTANCE_NAME>
step <CHOOSE_OPTION>
step emit
}
<RULE_NAME>
ruleset = Required. An integer that can be used to reference this ruleset by a pool.
type = Required. Default is “replicated.” How to handle data replication.
replicated = Data is replicated to different hard drives.
erasure = This a similar concept to RAID 5. Data is only replicated to one drive. This option helps to save space.
min_size
max_size
step take
step emit = Required. This signifies the end of the rule block.
[8]
Repair
Ceph automatically runs through a data integrity check called “scrubbing.” This checks the health of each placement group (object). Sometimes these can fail due to inconsistencies, commonly a mismatch in time on the OSD servers.
In this example, the placement group “1.28” failed to be scrubbed. This object exists on the 8, 11, and 20 OSD drives.
Check the health information.
Example:
$ sudo ceph health detail HEALTH_ERR 1 pgs inconsistent; 1 scrub errors pg 1.28 is active+clean+inconsistent, acting [8,11,20] 1 scrub errors
Manually run a repair.
Syntax:
$ sudo ceph pg repair <PLACEMENT_GROUP>
Example:
$ sudo ceph pg repair 1.28
Find the error:
Syntax:
$ sudo grep ERR /var/log/ceph/ceph-osd.<OSD_NUMBER>.log
Example:
$ sudo grep ERR /var/log/ceph/ceph-osd.11.log 2017-01-12 22:27:52.626252 7f5b511e8700 -1 log_channel(cluster) log [ERR] : 1.27 shard 12: soid 1:e4c200f7:::rbd_data.a1e002238e1f29.000000000000136d:head candidate had a read error
Find the bad file.
Syntax:
$ sudo find /var/lib/ceph/osd/ceph-<OSD_NUMBER>/current/<PLACEMENT_GROUP>_head/ -name '*<OBJECT_ID>*' -ls
Example:
$ sudo find /var/lib/ceph/osd/ceph-11/current/1.28_head/ -name "*a1e002238e1f29.000000000000136d*" /var/lib/ceph/osd/ceph-11/current/1.28_head/DIR_7/DIR_2/DIR_3/rbd\udata.b3e012238e1f29.000000000000136d__head_EF004327__1
Stop the OSD.
Syntax:
$ sudo systemctl stop ceph-osd@<OSD_NUMBER>.service
Example:
$ sudo systemctl stop ceph-osd@11.service
Flush the journal to save the current files cached in memory.
Syntax:
$ sudo ceph-osd -i <OSD_NUMBER> --flush-journal
Example:
$ sudo ceph-osd -i 11 --flush-journal
Move the bad object out of it’s current directory in the OSD.
Example:
$ sudo mv /var/lib/ceph/osd/ceph-11/current/1.28_head/DIR_7/DIR_2/DIR_3/rbd\\udata.b3e012238e1f29.000000000000136d__head_EF004327__1 /root/ceph_osd_backups/
Restart the OSD.
Syntax:
$ sudo systemctl restart ceph-osd@<OSD_NUMBER>.service
Example:
$ sudo systemctl restart ceph-osd@11.service
Run another placement group repair.
Syntax:
$ sudo ceph pg repair <PLACEMENT_GROUP>
Example:
$ sudo ceph pg repair 1.28
[9]
libvirt
Virtual machines that are run via the libvirt front-end can utilize Ceph’s RADOS block devices (RBDs) as their main disk.
Add the network disk to the available devices in the Virsh configuration.
<devices> <disk type='network' device='disk'> <source protocol='rbd' name='<POOL>/<IMAGE>'> <host name='<MONITOR_IP>' port='6789'/> </source> <target dev='vda' bus='virtio'/> </disk> ... </devices>
Authentication is required so the Ceph client credentials must be encrypted by libvirt. This encrypted hash is called a “secret.”
Create a Virsh template that has a secret of type “ceph” with a description for the end user. Optionally specify a UUID for this secret to be associated with or else one will be generated. Example file: ceph-secret.xml
<secret ephemeral='no' private='no'> <uuid>51757078-7d63-476f-8524-5d46119cfc8a</uuid> <usage type='ceph'> <name>The Ceph client key</name> </usage> </secret>
Define a blank secret from this template.
$ sudo virsh secret-define --file ceph-secret.xml
Verify that the secret was created.
$ sudo virsh secret-list
Set the secret to the Ceph client’s key. [10]
$ sudo virsh secret-set-value --secret <GENERATED_UUID> --base64 $(ceph auth get-key client.<USER>)
Finally, the secret needs to be referenced as type “ceph” with either the “usage” (description) or “uuid” or the secret element that has been created. [11]
<devices> <disk type='network' device='disk'> ... <auth username='<CLIENT>'> <secret type='ceph' usage='The Ceph client key'/> </auth> ... <disk> ... </devices>
CephFS
CephFS has been stable since the Ceph Jewel 10.2.0 release. This now includes repair utilities, including fsck. For clients, it is recommended to use a Linux kernel in the 4 series, or newer, to have the latest features and bug fixes for the file system. [12]
History
Bibliography
Karan Singh Learning Ceph (Birmingham, UK: Packet Publishing, 2015)
“Ceph Jewel Preview: a new store is coming, BlueStore.” Sebastien Han. March 21, 2016. Accessed December 5, 2018. https://www.sebastien-han.fr/blog/2016/03/21/ceph-a-new-store-is-coming/
“CACHE POOL.” Ceph Documentation. Accessed January 19, 2017. http://docs.ceph.com/docs/jewel/dev/cache-pool/
“CEPHX CONFIG REFERENCE.” Ceph Documentation. Accessed January 28, 2017. http://docs.ceph.com/docs/master/rados/configuration/auth-config-ref/
“INTRO TO CEPH.” Ceph Documentation. Accessed January 15, 2017. http://docs.ceph.com/docs/jewel/start/intro/
“Ceph Deployment.” Ceph Jewel Documentation. Accessed January 14, 2017. http://docs.ceph.com/docs/jewel/rados/deployment/
“ceph-ansible Wiki.” ceph-ansible GitHub. February 29, 2016. Accessed January 15, 2017. https://github.com/ceph/ceph-ansible/wiki
“CRUSH MAPS.” Ceph Documentation. Accessed January 29, 2017. http://docs.ceph.com/docs/master/rados/operations/crush-map/
“Ceph: manually repair object.” April 27, 2015. Accessed January 15, 2017. http://ceph.com/planet/ceph-manually-repair-object/
“USING LIBVIRT WITH CEPH RBD.” Ceph Documentation. Accessed January 27, 2017. http://docs.ceph.com/docs/master/rbd/libvirt/
“Secret XML.” libvirt. Accessed January 27, 2017. https://libvirt.org/formatsecret.html
“USING CEPHFS.” Ceph Documentation. Accessed January 15, 2017. http://docs.ceph.com/docs/master/cephfs/
“Ceph Releases (general)”. Ceph Documentation. July 27, 2020. Accessed August 13, 2020. https://docs.ceph.com/docs/master/releases/general/