From: John Wilkins Date: Tue, 18 Sep 2012 18:08:23 +0000 (-0700) Subject: :doc: Rewrote architecture paper. Still needs some work. X-Git-Tag: v0.53~120 X-Git-Url: http://git.apps.os.sepia.ceph.com/?a=commitdiff_plain;h=3d86194f9eae38e5c4990c32430817788b5eb9f2;p=ceph.git :doc: Rewrote architecture paper. Still needs some work. Signed-off-by: John Wilkins --- diff --git a/doc/architecture.rst b/doc/architecture.rst index 59c02ebe8ee5a..faff2b71935ed 100644 --- a/doc/architecture.rst +++ b/doc/architecture.rst @@ -1,189 +1,348 @@ -====================== - Architecture of Ceph -====================== - -Ceph is a distributed network storage and file system with distributed -metadata management and POSIX semantics. - -RADOS is a reliable object store, used by Ceph, but also directly -accessible. - -``radosgw`` is an S3-compatible RESTful HTTP service for object -storage, using RADOS storage. - -RBD is a Linux kernel feature that exposes RADOS storage as a block -device. Qemu/KVM also has a direct RBD client, that avoids the kernel -overhead. - - -.. index:: monitor, ceph-mon -.. _monitor: - -Monitor cluster +============== + Architecture +============== + +Ceph provides an infinitely scalable object storage system. It is based +upon on :abbr:`RADOS (Reliable Autonomic Distributed Object Store)`, which +you can read about in +`RADOS - A Scalable, Reliable Storage Service for Petabyte-scale Storage Clusters`_. +Its high-level features include providing a native interface to the +object storage system via ``librados``, and a number of service interfaces +built on top of ``librados``. These include: + +- **Block Devices:** The RADOS Block Device (RBD) service provides + resizable, thin-provisioned block devices with snapshotting, + cloning, striping a block device across the cluster for high + performance. Ceph supports both kernel objects (KO) and a + QEMU hypervisor that uses ``librbd`` directly--avoiding the + kernel object overhead for virtualized systems. + +- **RESTful Gateway:** The RADOS Gateway (RGW) service provides + RESTful APIs with interfaces that are compatible with Amazon S3 + and OpenStack Swift. + +- **Ceph FS**: The Ceph Filesystem (CephFS) service provides + a POSIX compliant filesystem. + +Ceph OSDs store all data--whether it comes through RBD, RGW, or +CephFS--as objects in the object storage system. Ceph can run +additional instances of OSDs, MDSs, and monitors for scalability +and high availability. The following diagram depicts the +high-level architecture. + +.. ditaa:: +--------+ +----------+ +-------+ +--------+ +------+ + | RBD KO | | QeMu RBD | | RGW | | CephFS | | FUSE | + +--------+ +----------+ +-------+ +--------+ +------+ + +---------------------+ +-----------------+ + | librbd | | libcephfs | + +---------------------+ +-----------------+ + +---------------------------------------------------+ + | librados (C, C++, Java, Python, PHP, etc.) | + +---------------------------------------------------+ + +---------------+ +---------------+ +---------------+ + | OSDs | | MDSs | | Monitors | + +---------------+ +---------------+ +---------------+ + + +.. _RADOS - A Scalable, Reliable Storage Service for Petabyte-scale Storage Clusters: http://ceph.com/papers/weil-rados-pdsw07.pdf + + +Removing Limitations +==================== + +Today's storage systems have demonstrated an ability to scale out, but with some +significant limitations: interfaces, session managers, and stateful sessions +with a centralized point of access often limit the scalability of today's +storage architectures. Furthermore, a centralized interface that dispatches +requests from clients to server nodes within a cluster and subsequently routes +responses from those server nodes back to clients will hit a scalability and/or +performance limitation. + +Another problem for storage systems is the need to manually rebalance data when +increasing or decreasing the size of a data cluster. Manual rebalancing works +fine on small scales, but it is a nightmare at larger scales because hardware +failure becomes an expectation rather than an exception. + +The operational challenges of managing legacy technologies with the burgeoning +growth in the demand for unstructured storage makes legacy technologies +inadequate for scaling into petabytes and beyond. Some legacy technologies +(e.g., SAN) can be considerably more expensive, and more challenging to +maintain when compared to using commodity hardware. Ceph uses commodity +hardware, becaues it is substantially less expensive to purchase (or to +replace), and it only requires standard system administration skills to use it. + + +How Ceph Scales =============== -``ceph-mon`` is a lightweight daemon that provides a consensus for -distributed decisionmaking in a Ceph/RADOS cluster. - -It also is the initial point of contact for new clients, and will hand -out information about the topology of the cluster, such as the -``osdmap``. - -You normally run 3 ``ceph-mon`` daemons, on 3 separate physical machines, -isolated from each other; for example, in different racks or rows. - -You could run just 1 instance, but that means giving up on high -availability. - -You may use the same hosts for ``ceph-mon`` and other purposes. - -``ceph-mon`` processes talk to each other using a Paxos_\-style -protocol. They discover each other via the ``[mon.X] mon addr`` fields -in ``ceph.conf``. - -.. todo:: What about ``monmap``? Fact check. - -Any decision requires the majority of the ``ceph-mon`` processes to be -healthy and communicating with each other. For this reason, you never -want an even number of ``ceph-mon``\s; there is no unambiguous majority -subgroup for an even number. - -.. _Paxos: http://en.wikipedia.org/wiki/Paxos_algorithm - -.. todo:: explain monmap - - -.. index:: RADOS, OSD, ceph-osd, object -.. _rados: - -RADOS -===== - -``ceph-osd`` is the storage daemon that provides the RADOS service. It -uses ``ceph-mon`` for cluster membership, services object read/write/etc -request from clients, and peers with other ``ceph-osd``\s for data -replication. - -The data model is fairly simple on this level. There are multiple -named pools, and within each pool there are named objects, in a flat -namespace (no directories). Each object has both data and metadata. - -The data for an object is a single, potentially big, series of -bytes. Additionally, the series may be sparse, it may have holes that -contain binary zeros, and take up no actual storage. - -The metadata is an unordered set of key-value pairs. It's semantics -are completely up to the client; for example, the Ceph filesystem uses -metadata to store file owner etc. - -.. todo:: Verify that metadata is unordered. - -Underneath, ``ceph-osd`` stores the data on a local filesystem. We -recommend using Btrfs_, but any POSIX filesystem that has extended -attributes should work. - -.. _Btrfs: http://en.wikipedia.org/wiki/Btrfs - -.. todo:: write about access control - -.. todo:: explain osdmap - -.. todo:: explain plugins ("classes") - - -.. index:: Ceph filesystem, Ceph Distributed File System, MDS, ceph-mds -.. _cephfs: - -Ceph filesystem +In traditional architectures, clients talk to a centralized component (e.g., a gateway, +broker, API, facade, etc.), which acts as a single point of entry to a complex subsystem. +This imposes a limit to both performance and scalability, while introducing a single +point of failure (i.e., if the centralized component goes down, the whole system goes +down too). + +Ceph uses a new and innovative approach. Ceph clients contact a Ceph monitor +and retrieve a copy of the cluster map. The :abbr:`CRUSH (Controlled Replication +Under Scalable Hashing)` algorithm allows a client to compute where data +*should* be stored, and enables the client to contact the primary OSD to store +or retrieve the data. The OSD also uses the CRUSH algorithm, but the OSD uses it +to compute where replicas of data should be stored (and for rebalancing). +For a detailed discussion of CRUSH, see +`CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_ + +The Ceph storage system supports the notion of 'Pools', which are logical +partitions for storing object data. Pools set ownership/access, the number of +object replicas, the number of placement groups, and the CRUSH rule set to use. +Each pool has a number of placement groups that are mapped dynamically to OSDs. +When clients store data, Ceph maps the object data to placement groups. +The following diagram depicts how CRUSH maps objects to placement groups, and +placement groups to OSDs. + +.. ditaa:: + /-----\ /-----\ /-----\ /-----\ /-----\ + | obj | | obj | | obj | | obj | | obj | + \-----/ \-----/ \-----/ \-----/ \-----/ + | | | | | + +--------+--------+ +---+----+ + | | + v v + +-----------------------+ +-----------------------+ + | Placement Group #1 | | Placement Group #2 | + | | | | + +-----------------------+ +-----------------------+ + | | + | +-----------------------+---+ + +------+------+-------------+ | + | | | | + v v v v + /----------\ /----------\ /----------\ /----------\ + | | | | | | | | + | OSD #1 | | OSD #2 | | OSD #3 | | OSD #4 | + | | | | | | | | + \----------/ \----------/ \----------/ \----------/ + +Mapping objects to placement groups instead of directly to OSDs creates a layer +of indirection between the OSD and the client. The cluster must be able to grow +(or shrink) and rebalance data dynamically. If the client "knew" which OSD had +the data, that would create a tight coupling between the client and the OSD. +Instead, the CRUSH algorithm maps the data to a placement group and then maps +the placement group to one or more OSDs. This layer of indirection allows Ceph +to rebalance dynamically when new OSDs come online. + +With a copy of the cluster map and the CRUSH algorithm, the client can compute +exactly which OSD to use when reading or writing a particular piece of data. + +In a typical write scenario, a client uses the CRUSH algorithm to compute where +to store data, maps the data to a placement group, then looks at the CRUSH map +to identify the primary primary OSD for the placement group. Clients write data +to the identified placement group in the primary OSD. Then, the primary OSD with +its own copy of the CRUSH map identifies the secondary and tertiary OSDs for +replication purposes, and replicates the data to the appropriate placement +groups in the secondary and tertiary OSDs (as many OSDs as additional +replicas), and responds to the client once it has confirmed the data was +stored successfully. + +.. ditaa:: +--------+ Write +--------------+ Replica 1 +----------------+ + | Client |*-------------->| Primary OSD |*---------------->| Secondary OSD | + | |<--------------*| |<----------------*| | + +--------+ Write Ack +--------------+ Replica 1 Ack +----------------+ + ^ * + | | Replica 2 +----------------+ + | +----------------------->| Tertiary OSD | + +--------------------------*| | + Replica 2 Ack +----------------+ + + +Since any network device has a limit to the number of concurrent connections it +can support, a centralized system has a low physical limit at high scales. By +enabling clients to contact nodes directly, Ceph increases both performance and +total system capacity simultaneously, while removing a single point of failure. +Ceph clients can maintain a session when they need to, and with a particular +OSD instead of a centralized server. + +.. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: http://ceph.com/papers/weil-crush-sc06.pdf + + +Peer-Aware Nodes +================ + +Ceph's cluster map determines whether a node in a network is ``in`` the +Ceph cluster or ``out`` of the Ceph cluster. + +.. ditaa:: +----------------+ + | | + | Node ID In | + | | + +----------------+ + ^ + | + | + v + +----------------+ + | | + | Node ID Out | + | | + +----------------+ + +In must clustered architectures the primary purpose of cluster membership +is so that a centralized interface knows which hosts it can access. Ceph +takes it a step further: Cephs nodes are cluster aware. Each node knows +about other nodes in the cluster. This enables Ceph's monitor, OSD, and +metadata server daemons to interact directly with each other. One major +benefit of this approach is that Ceph can utilize the CPU and RAM of its +nodes to easily perform tasks that would bog down a centralized server. + +.. todo:: Explain OSD maps, Monitor Maps, MDS maps + + +Smart OSDs +========== + +Ceph OSDs join a cluster and report on their status. At the lowest level, +the OSD status is ``up`` or ``down`` reflecting whether or not it is +running and able service requests. If an OSD is ``down`` and ``in`` +the cluster, it may indicate the failure of the OSD. + +With peer awareness, OSDs can communicate with other OSDs and monitors +to perform tasks. OSDs take client requests to read data from or write +data to pools, which have placement groups. When a client makes a request +to write data to a primary OSD, the primary OSD knows how to determine +which OSDs have the placement groups for the replica copies, and then +update those OSDs. This means that OSDs can also take requests from +other OSDs. With multiple replicas of data across OSDs, OSDs can also +"peer" to ensure that the placement groups are in sync. See +`Placement Group States`_ and `Placement Group Concepts`_ for details. + +If an OSD is not running (e.g., it crashes), the OSD cannot notify the monitor +that it is ``down``. The monitor can ping an OSD periodically to ensure that it +is running. However, Ceph also empowers OSDs to determine if a neighboring OSD +is ``down``, to update the cluster map and to report it to the monitor(s). When +an OSD is ``down``, the data in the placement group is said to be ``degraded``. +If the OSD is ``down`` and ``in``, but subsequently taken ``out`` of the +cluster, the OSDs receive an update to the cluster map and rebalance the +placement groups within the cluster automatically. + +OSDs store all data as objects in a flat namespace (e.g., no hieararchy of +directories). An object has an identifier, binary data, and metadata consisting +of a set of name/value pairs. The semantics are completely up to the client. For +example, CephFS uses metadata to store file attributes such as the file owner, +created date, last modified date, and so forth. + + +.. ditaa:: /------+------------------------------+----------------\ + | ID | Binary Data | Metadata | + +------+------------------------------+----------------+ + | 1234 | 0101010101010100110101010010 | name1 = value1 | + | | 0101100001010100110101010010 | name2 = value2 | + | | 0101100001010100110101010010 | nameN = valueN | + \------+------------------------------+----------------/ + +As part of maintaining data consistency and cleanliness, Ceph OSDs +can also scrub the data. That is, Ceph OSDs can compare object metadata +across replicas to catch OSD bugs or filesystem errors (daily). OSDs can +also do deeper scrubbing by comparing data in objects bit-for-bit to find +bad sectors on a disk that weren't apparent in a light scrub (weekly). + +.. todo:: explain "classes" + +.. _Placement Group States: ../cluster-ops/pg-states +.. _Placement Group Concepts: ../cluster-ops/pg-concepts + +Monitor Quorums =============== -The Ceph filesystem service is provided by a daemon called -``ceph-mds``. It uses RADOS to store all the filesystem metadata -(directories, file ownership, access modes, etc), and directs clients -to access RADOS directly for the file contents. +Ceph's monitors maintain a master copy of the cluster map. So Ceph daemons and +clients merely contact the monitor periodically to ensure they have the most +recent copy of the cluster map. Ceph monitors are light-weight processes, but +for added reliability and fault tolerance, Ceph supports distributed monitors. +Ceph must have agreement among various monitor instances regarding the state of +the cluster. To establish a consensus, Ceph always uses an odd number of +monitors (e.g., 1, 3, 5, 7, etc) and the `Paxos`_ algorithm in order to +establish a consensus. + +.. _Paxos: http://en.wikipedia.org/wiki/Paxos_(computer_science) + +MDS +=== + +The Ceph filesystem service is provided by a daemon called ``ceph-mds``. It uses +RADOS to store all the filesystem metadata (directories, file ownership, access +modes, etc), and directs clients to access RADOS directly for the file contents. +The Ceph filesystem aims for POSIX compatibility. ``ceph-mds`` can run as a +single process, or it can be distributed out to multiple physical machines, +either for high availability or for scalability. + +- **High Availability**: The extra ``ceph-mds`` instances can be `standby`, + ready to take over the duties of any failed ``ceph-mds`` that was + `active`. This is easy because all the data, including the journal, is + stored on RADOS. The transition is triggered automatically by ``ceph-mon``. + +- **Scalability**: Multiple ``ceph-mds`` instances can be `active`, and they + will split the directory tree into subtrees (and shards of a single + busy directory), effectively balancing the load amongst all `active` + servers. -The Ceph filesystem aims for POSIX compatibility, except for a few -chosen differences. See :doc:`/appendix/differences-from-posix`. +Combinations of `standby` and `active` etc are possible, for example +running 3 `active` ``ceph-mds`` instances for scaling, and one `standby` +intance for high availability. -``ceph-mds`` can run as a single process, or it can be distributed out to -multiple physical machines, either for high availability or for -scalability. -For high availability, the extra ``ceph-mds`` instances can be `standby`, -ready to take over the duties of any failed ``ceph-mds`` that was -`active`. This is easy because all the data, including the journal, is -stored on RADOS. The transition is triggered automatically by -``ceph-mon``. +Client Interfaces +================= -For scalability, multiple ``ceph-mds`` instances can be `active`, and they -will split the directory tree into subtrees (and shards of a single -busy directory), effectively balancing the load amongst all `active` -servers. +librados +-------- -Combinations of `standby` and `active` etc are possible, for example -running 3 `active` ``ceph-mds`` instances for scaling, and one `standby`. +.. todo:: Cephx. Summarize how much Ceph trusts the client, for what parts (security vs reliability). +.. todo:: Access control +.. todo:: Snapshotting, Import/Export, Backup +.. todo:: native APIs -To control the number of `active` ``ceph-mds``\es, see -:doc:`/ops/manage/grow/mds`. +RBD +--- -.. topic:: Status as of 2011-09: +RBD stripes a block device image over multiple objects in the cluster, where +each object gets mapped to a placement group and distributed, and the placement +groups are spread across separate ``ceph-osd`` daemons throughout the cluster. - Multiple `active` ``ceph-mds`` operation is stable under normal - circumstances, but some failure scenarios may still cause - operational issues. +.. important:: Striping allows RBD block devices to perform better than a single server could! -.. todo:: document `standby-replay` +RBD's thin-provisioned snapshottable block devices are an attractive option for +virtualization and cloud computing. In virtual machine scenarios, people +typically deploy RBD with the ``rbd`` network storage driver in Qemu/KVM, where +the host machine uses ``librbd`` to provide a block device service to the guest. +Many cloud computing stacks use ``libvirt`` to integrate with hypervisors. You +can use RBD thin-provisioned block devices with Qemu and libvirt to support +OpenStack and CloudStack among other solutions. -.. todo:: mds.0 vs mds.alpha etc details +While we do not provide ``librbd`` support with other hypervisors at this time, you may +also use RBD kernel objects to provide a block device to a client. Other virtualization +technologies such as Xen can access the RBD kernel object(s). This is done with the +command-line tool ``rbd``. -.. index:: RADOS Gateway, radosgw -.. _radosgw: +RGW +--- -``radosgw`` -=========== +The RADOS Gateway daemon, ``radosgw``, is a FastCGI service that provides a +RESTful_ HTTP API to store objects and metadata. It layers on top of RADOS with +its own data formats, and maintains it's own user database, authentication, and +access control. The RADOS Gateway uses a unified namespace, which means you can +use either the OpenStack Swift-compatible API or the Amazon S3-compatible API. +For example, you can write data using the S3-comptable API with one application +and then read data using the Swift-compatible API with another application. -``radosgw`` is a FastCGI service that provides a RESTful_ HTTP API to -store objects and metadata. It layers on top of RADOS with its own -data formats, and maintains it's own user database, authentication, -access control, and so on. +See `RADOS Gateway`_ for details. +.. _RADOS Gateway: ../radosgw/ .. _RESTful: http://en.wikipedia.org/wiki/RESTful .. index:: RBD, Rados Block Device -.. _rbd: - -Rados Block Device (RBD) -======================== - -In virtual machine scenarios, RBD is typically used via the ``rbd`` -network storage driver in Qemu/KVM, where the host machine uses -``librbd`` to provide a block device service to the guest. - -Alternatively, as no direct ``librbd`` support is available in Xen, -the Linux kernel can act as the RBD client and provide a real block -device on the host machine, that can then be accessed by the -virtualization. This is done with the command-line tool ``rbd`` (see -:doc:`/ops/rbd`). - -The latter is also useful in non-virtualized scenarios. - -Internally, RBD stripes the device image over multiple RADOS objects, -each typically located on a separate ``ceph-osd``, allowing it to perform -better than a single server could. - - -Client -====== - -.. todo:: cephfs, ceph-fuse, librados, libcephfs, librbd - -.. todo:: Summarize how much Ceph trusts the client, for what parts (security vs reliability). -TODO -==== +CephFS +------ -.. todo:: Example scenarios Ceph projects are/not suitable for +.. todo:: cephfs, ceph-fuse \ No newline at end of file