From 8e166fa1b70a75b62dd89f156e4653fb150dccaf Mon Sep 17 00:00:00 2001 From: Sage Weil Date: Wed, 2 Aug 2017 15:48:36 -0400 Subject: [PATCH] doc/rados/configuration: document bluestore Initial pass here. Not yet complete. Signed-off-by: Sage Weil --- .../filesystem-recommendations.rst | 62 -------------- doc/rados/configuration/index.rst | 2 +- doc/rados/configuration/storage-devices.rst | 83 +++++++++++++++++++ 3 files changed, 84 insertions(+), 63 deletions(-) delete mode 100644 doc/rados/configuration/filesystem-recommendations.rst create mode 100644 doc/rados/configuration/storage-devices.rst diff --git a/doc/rados/configuration/filesystem-recommendations.rst b/doc/rados/configuration/filesystem-recommendations.rst deleted file mode 100644 index c967d60ce07..00000000000 --- a/doc/rados/configuration/filesystem-recommendations.rst +++ /dev/null @@ -1,62 +0,0 @@ -=========================================== - Hard Disk and File System Recommendations -=========================================== - -.. index:: hard drive preparation - -Hard Drive Prep -=============== - -Ceph aims for data safety, which means that when the :term:`Ceph Client` -receives notice that data was written to a storage drive, that data was actually -written to the storage drive. For old kernels (<2.6.33), disable the write cache -if the journal is on a raw drive. Newer kernels should work fine. - -Use ``hdparm`` to disable write caching on the hard disk:: - - sudo hdparm -W 0 /dev/hda 0 - -In production environments, we recommend running a :term:`Ceph OSD Daemon` with -separate drives for the operating system and the data. If you run data and an -operating system on a single disk, we recommend creating a separate partition -for your data. - -.. index:: filesystems - -Filesystems -=========== - -Ceph OSD Daemons rely heavily upon the stability and performance of the -underlying filesystem. - -Recommended ------------ - -We currently recommend ``XFS`` for production deployments. - -Not recommended ---------------- - -We recommand *against* using ``btrfs`` due to the lack of a stable -version to test against and frequent bugs in the ENOSPC handling. - -We recommend *against* using ``ext4`` due to limitations in the size -of xattrs it can store, and the problems this causes with the way Ceph -handles long RADOS object names. Although these issues will generally -not surface with Ceph clusters using only short object names (e.g., an -RBD workload that does not include long RBD image names), other users -like RGW make extensive use of long object names and can break. - -Starting with the Jewel release, the ``ceph-osd`` daemon will refuse -to start if the configured max object name cannot be safely stored on -``ext4``. If the cluster is only being used with short object names -(e.g., RBD only), you can continue using ``ext4`` by setting the -following configuration option:: - - osd max object name len = 256 - osd max object namespace len = 64 - -.. note:: This may result in difficult-to-diagnose errors if you try - to use RGW or other librados clients that do not properly - handle or politely surface any resulting ENAMETOOLONG - errors. diff --git a/doc/rados/configuration/index.rst b/doc/rados/configuration/index.rst index 264141c1047..48b58efb707 100644 --- a/doc/rados/configuration/index.rst +++ b/doc/rados/configuration/index.rst @@ -32,7 +32,7 @@ For general object store configuration, refer to the following: .. toctree:: :maxdepth: 1 - Disks and Filesystems + Storage devices ceph-conf diff --git a/doc/rados/configuration/storage-devices.rst b/doc/rados/configuration/storage-devices.rst new file mode 100644 index 00000000000..83c0c9b9fad --- /dev/null +++ b/doc/rados/configuration/storage-devices.rst @@ -0,0 +1,83 @@ +================= + Storage Devices +================= + +There are two Ceph daemons that store data on disk: + +* **Ceph OSDs** (or Object Storage Daemons) are where most of the + data is stored in Ceph. Generally speaking, each OSD is backed by + a single storage device, like a traditional hard disk (HDD) or + solid state disk (SSD). OSDs can also be backed by a combination + of devices, like a HDD for most data and an SSD (or partition of an + SSD) for some metadata. The number of OSDs in a cluster is + generally a function of how much data will be stored, how big each + storage device will be, and the level and type of redundancy + (replication or erasure coding). +* **Ceph Monitor** daemons manage critical cluster state like cluster + membership and authentication information. For smaller clusters a + few gigabytes is all that is needed, although for larger clusters + the monitor database can reach tens or possibly hundreds of + gigabytes. + + +OSD Backends +============ + +There are two ways that OSDs can manage the data they store. Starting +with the Luminous 12.2.z release, the new default (and recommended) backend is +*BlueStore*. Prior to Luminous, the default (and only option) was +*FileStore*. + +BlueStore +--------- + +BlueStore is a special-purpose storage backend designed specifically +for managing data on disk for Ceph OSD workloads. It is motivated by +experience supporting and managing OSDs using FileStore over the +last ten years. Key BlueStore features include: + +* Direct management of storage devices. BlueStore consumes raw block + devices or partitions. This avoids any intervening layers of + abstraction (such as local file systems like XFS) that may limit + performance or add complexity. +* Metadata management with RocksDB. We embed RocksDB's key/value database + in order to manage internal metadata, such as the mapping from object + names to block locations on disk. +* Full data and metadata checksumming. By default all data and + metadata written to BlueStore is protected by one or more + checksums. No data or metadata will be read from disk or returned + to the user without being verified. +* Inline compression. Data written may be optionally compressed + before being written to disk. +* Multi-device metadata tiering. BlueStore allows its internal + journal (write-ahead log) to be written to a separate, high-speed + device (like an SSD, NVMe, or NVDIMM) to increased performance. If + a significant amount of faster storage is available, internal + metadata can also be stored on the faster device. +* Efficient copy-on-write. RBD and CephFS snapshots rely on a + copy-on-write *clone* mechanism that is implemented efficiently in + BlueStore. This results in efficient IO both for regular snapshots + and for erasure coded pools (which rely on cloning to implement + efficient two-phase commits). + +For more information, see :doc:`bluestore-config-ref`. + +FileStore +--------- + +FileStore is the legacy approach to storing objects in Ceph. It +relies on a standard file system (normally XFS) in combination with a +key/value database (traditionally LevelDB, now RocksDB) for some +metadata. + +FileStore is well-tested and widely used in production but suffers +from many performance deficiencies due to its overall design and +reliance on a traditional file system for storing object data. + +Although FileStore is generally capable of functioning on most +POSIX-compatible file systems (including btrfs and ext4), we only +recommend that XFS be used. Both btrfs and ext4 have known bugs and +deficiencies and their use may lead to data loss. By default all Ceph +provisioning tools will use XFS. + +For more information, see :doc:`filestore-config-ref`. -- 2.39.5