-======================
- Architecture of Ceph
-======================
-
-Ceph is a distributed network storage and file system with distributed
-metadata management and POSIX semantics.
-
-RADOS is a reliable object store, used by Ceph, but also directly
-accessible.
-
-``radosgw`` is an S3-compatible RESTful HTTP service for object
-storage, using RADOS storage.
-
-RBD is a Linux kernel feature that exposes RADOS storage as a block
-device. Qemu/KVM also has a direct RBD client, that avoids the kernel
-overhead.
-
-
-.. index:: monitor, ceph-mon
-.. _monitor:
-
-Monitor cluster
+==============
+ Architecture
+==============
+
+Ceph provides an infinitely scalable object storage system. It is based
+upon on :abbr:`RADOS (Reliable Autonomic Distributed Object Store)`, which
+you can read about in
+`RADOS - A Scalable, Reliable Storage Service for Petabyte-scale Storage Clusters`_.
+Its high-level features include providing a native interface to the
+object storage system via ``librados``, and a number of service interfaces
+built on top of ``librados``. These include:
+
+- **Block Devices:** The RADOS Block Device (RBD) service provides
+ resizable, thin-provisioned block devices with snapshotting,
+ cloning, striping a block device across the cluster for high
+ performance. Ceph supports both kernel objects (KO) and a
+ QEMU hypervisor that uses ``librbd`` directly--avoiding the
+ kernel object overhead for virtualized systems.
+
+- **RESTful Gateway:** The RADOS Gateway (RGW) service provides
+ RESTful APIs with interfaces that are compatible with Amazon S3
+ and OpenStack Swift.
+
+- **Ceph FS**: The Ceph Filesystem (CephFS) service provides
+ a POSIX compliant filesystem.
+
+Ceph OSDs store all data--whether it comes through RBD, RGW, or
+CephFS--as objects in the object storage system. Ceph can run
+additional instances of OSDs, MDSs, and monitors for scalability
+and high availability. The following diagram depicts the
+high-level architecture.
+
+.. ditaa:: +--------+ +----------+ +-------+ +--------+ +------+
+ | RBD KO | | QeMu RBD | | RGW | | CephFS | | FUSE |
+ +--------+ +----------+ +-------+ +--------+ +------+
+ +---------------------+ +-----------------+
+ | librbd | | libcephfs |
+ +---------------------+ +-----------------+
+ +---------------------------------------------------+
+ | librados (C, C++, Java, Python, PHP, etc.) |
+ +---------------------------------------------------+
+ +---------------+ +---------------+ +---------------+
+ | OSDs | | MDSs | | Monitors |
+ +---------------+ +---------------+ +---------------+
+
+
+.. _RADOS - A Scalable, Reliable Storage Service for Petabyte-scale Storage Clusters: http://ceph.com/papers/weil-rados-pdsw07.pdf
+
+
+Removing Limitations
+====================
+
+Today's storage systems have demonstrated an ability to scale out, but with some
+significant limitations: interfaces, session managers, and stateful sessions
+with a centralized point of access often limit the scalability of today's
+storage architectures. Furthermore, a centralized interface that dispatches
+requests from clients to server nodes within a cluster and subsequently routes
+responses from those server nodes back to clients will hit a scalability and/or
+performance limitation.
+
+Another problem for storage systems is the need to manually rebalance data when
+increasing or decreasing the size of a data cluster. Manual rebalancing works
+fine on small scales, but it is a nightmare at larger scales because hardware
+failure becomes an expectation rather than an exception.
+
+The operational challenges of managing legacy technologies with the burgeoning
+growth in the demand for unstructured storage makes legacy technologies
+inadequate for scaling into petabytes and beyond. Some legacy technologies
+(e.g., SAN) can be considerably more expensive, and more challenging to
+maintain when compared to using commodity hardware. Ceph uses commodity
+hardware, becaues it is substantially less expensive to purchase (or to
+replace), and it only requires standard system administration skills to use it.
+
+
+How Ceph Scales
===============
-``ceph-mon`` is a lightweight daemon that provides a consensus for
-distributed decisionmaking in a Ceph/RADOS cluster.
-
-It also is the initial point of contact for new clients, and will hand
-out information about the topology of the cluster, such as the
-``osdmap``.
-
-You normally run 3 ``ceph-mon`` daemons, on 3 separate physical machines,
-isolated from each other; for example, in different racks or rows.
-
-You could run just 1 instance, but that means giving up on high
-availability.
-
-You may use the same hosts for ``ceph-mon`` and other purposes.
-
-``ceph-mon`` processes talk to each other using a Paxos_\-style
-protocol. They discover each other via the ``[mon.X] mon addr`` fields
-in ``ceph.conf``.
-
-.. todo:: What about ``monmap``? Fact check.
-
-Any decision requires the majority of the ``ceph-mon`` processes to be
-healthy and communicating with each other. For this reason, you never
-want an even number of ``ceph-mon``\s; there is no unambiguous majority
-subgroup for an even number.
-
-.. _Paxos: http://en.wikipedia.org/wiki/Paxos_algorithm
-
-.. todo:: explain monmap
-
-
-.. index:: RADOS, OSD, ceph-osd, object
-.. _rados:
-
-RADOS
-=====
-
-``ceph-osd`` is the storage daemon that provides the RADOS service. It
-uses ``ceph-mon`` for cluster membership, services object read/write/etc
-request from clients, and peers with other ``ceph-osd``\s for data
-replication.
-
-The data model is fairly simple on this level. There are multiple
-named pools, and within each pool there are named objects, in a flat
-namespace (no directories). Each object has both data and metadata.
-
-The data for an object is a single, potentially big, series of
-bytes. Additionally, the series may be sparse, it may have holes that
-contain binary zeros, and take up no actual storage.
-
-The metadata is an unordered set of key-value pairs. It's semantics
-are completely up to the client; for example, the Ceph filesystem uses
-metadata to store file owner etc.
-
-.. todo:: Verify that metadata is unordered.
-
-Underneath, ``ceph-osd`` stores the data on a local filesystem. We
-recommend using Btrfs_, but any POSIX filesystem that has extended
-attributes should work.
-
-.. _Btrfs: http://en.wikipedia.org/wiki/Btrfs
-
-.. todo:: write about access control
-
-.. todo:: explain osdmap
-
-.. todo:: explain plugins ("classes")
-
-
-.. index:: Ceph filesystem, Ceph Distributed File System, MDS, ceph-mds
-.. _cephfs:
-
-Ceph filesystem
+In traditional architectures, clients talk to a centralized component (e.g., a gateway,
+broker, API, facade, etc.), which acts as a single point of entry to a complex subsystem.
+This imposes a limit to both performance and scalability, while introducing a single
+point of failure (i.e., if the centralized component goes down, the whole system goes
+down too).
+
+Ceph uses a new and innovative approach. Ceph clients contact a Ceph monitor
+and retrieve a copy of the cluster map. The :abbr:`CRUSH (Controlled Replication
+Under Scalable Hashing)` algorithm allows a client to compute where data
+*should* be stored, and enables the client to contact the primary OSD to store
+or retrieve the data. The OSD also uses the CRUSH algorithm, but the OSD uses it
+to compute where replicas of data should be stored (and for rebalancing).
+For a detailed discussion of CRUSH, see
+`CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_
+
+The Ceph storage system supports the notion of 'Pools', which are logical
+partitions for storing object data. Pools set ownership/access, the number of
+object replicas, the number of placement groups, and the CRUSH rule set to use.
+Each pool has a number of placement groups that are mapped dynamically to OSDs.
+When clients store data, Ceph maps the object data to placement groups.
+The following diagram depicts how CRUSH maps objects to placement groups, and
+placement groups to OSDs.
+
+.. ditaa::
+ /-----\ /-----\ /-----\ /-----\ /-----\
+ | obj | | obj | | obj | | obj | | obj |
+ \-----/ \-----/ \-----/ \-----/ \-----/
+ | | | | |
+ +--------+--------+ +---+----+
+ | |
+ v v
+ +-----------------------+ +-----------------------+
+ | Placement Group #1 | | Placement Group #2 |
+ | | | |
+ +-----------------------+ +-----------------------+
+ | |
+ | +-----------------------+---+
+ +------+------+-------------+ |
+ | | | |
+ v v v v
+ /----------\ /----------\ /----------\ /----------\
+ | | | | | | | |
+ | OSD #1 | | OSD #2 | | OSD #3 | | OSD #4 |
+ | | | | | | | |
+ \----------/ \----------/ \----------/ \----------/
+
+Mapping objects to placement groups instead of directly to OSDs creates a layer
+of indirection between the OSD and the client. The cluster must be able to grow
+(or shrink) and rebalance data dynamically. If the client "knew" which OSD had
+the data, that would create a tight coupling between the client and the OSD.
+Instead, the CRUSH algorithm maps the data to a placement group and then maps
+the placement group to one or more OSDs. This layer of indirection allows Ceph
+to rebalance dynamically when new OSDs come online.
+
+With a copy of the cluster map and the CRUSH algorithm, the client can compute
+exactly which OSD to use when reading or writing a particular piece of data.
+
+In a typical write scenario, a client uses the CRUSH algorithm to compute where
+to store data, maps the data to a placement group, then looks at the CRUSH map
+to identify the primary primary OSD for the placement group. Clients write data
+to the identified placement group in the primary OSD. Then, the primary OSD with
+its own copy of the CRUSH map identifies the secondary and tertiary OSDs for
+replication purposes, and replicates the data to the appropriate placement
+groups in the secondary and tertiary OSDs (as many OSDs as additional
+replicas), and responds to the client once it has confirmed the data was
+stored successfully.
+
+.. ditaa:: +--------+ Write +--------------+ Replica 1 +----------------+
+ | Client |*-------------->| Primary OSD |*---------------->| Secondary OSD |
+ | |<--------------*| |<----------------*| |
+ +--------+ Write Ack +--------------+ Replica 1 Ack +----------------+
+ ^ *
+ | | Replica 2 +----------------+
+ | +----------------------->| Tertiary OSD |
+ +--------------------------*| |
+ Replica 2 Ack +----------------+
+
+
+Since any network device has a limit to the number of concurrent connections it
+can support, a centralized system has a low physical limit at high scales. By
+enabling clients to contact nodes directly, Ceph increases both performance and
+total system capacity simultaneously, while removing a single point of failure.
+Ceph clients can maintain a session when they need to, and with a particular
+OSD instead of a centralized server.
+
+.. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: http://ceph.com/papers/weil-crush-sc06.pdf
+
+
+Peer-Aware Nodes
+================
+
+Ceph's cluster map determines whether a node in a network is ``in`` the
+Ceph cluster or ``out`` of the Ceph cluster.
+
+.. ditaa:: +----------------+
+ | |
+ | Node ID In |
+ | |
+ +----------------+
+ ^
+ |
+ |
+ v
+ +----------------+
+ | |
+ | Node ID Out |
+ | |
+ +----------------+
+
+In must clustered architectures the primary purpose of cluster membership
+is so that a centralized interface knows which hosts it can access. Ceph
+takes it a step further: Cephs nodes are cluster aware. Each node knows
+about other nodes in the cluster. This enables Ceph's monitor, OSD, and
+metadata server daemons to interact directly with each other. One major
+benefit of this approach is that Ceph can utilize the CPU and RAM of its
+nodes to easily perform tasks that would bog down a centralized server.
+
+.. todo:: Explain OSD maps, Monitor Maps, MDS maps
+
+
+Smart OSDs
+==========
+
+Ceph OSDs join a cluster and report on their status. At the lowest level,
+the OSD status is ``up`` or ``down`` reflecting whether or not it is
+running and able service requests. If an OSD is ``down`` and ``in``
+the cluster, it may indicate the failure of the OSD.
+
+With peer awareness, OSDs can communicate with other OSDs and monitors
+to perform tasks. OSDs take client requests to read data from or write
+data to pools, which have placement groups. When a client makes a request
+to write data to a primary OSD, the primary OSD knows how to determine
+which OSDs have the placement groups for the replica copies, and then
+update those OSDs. This means that OSDs can also take requests from
+other OSDs. With multiple replicas of data across OSDs, OSDs can also
+"peer" to ensure that the placement groups are in sync. See
+`Placement Group States`_ and `Placement Group Concepts`_ for details.
+
+If an OSD is not running (e.g., it crashes), the OSD cannot notify the monitor
+that it is ``down``. The monitor can ping an OSD periodically to ensure that it
+is running. However, Ceph also empowers OSDs to determine if a neighboring OSD
+is ``down``, to update the cluster map and to report it to the monitor(s). When
+an OSD is ``down``, the data in the placement group is said to be ``degraded``.
+If the OSD is ``down`` and ``in``, but subsequently taken ``out`` of the
+cluster, the OSDs receive an update to the cluster map and rebalance the
+placement groups within the cluster automatically.
+
+OSDs store all data as objects in a flat namespace (e.g., no hieararchy of
+directories). An object has an identifier, binary data, and metadata consisting
+of a set of name/value pairs. The semantics are completely up to the client. For
+example, CephFS uses metadata to store file attributes such as the file owner,
+created date, last modified date, and so forth.
+
+
+.. ditaa:: /------+------------------------------+----------------\
+ | ID | Binary Data | Metadata |
+ +------+------------------------------+----------------+
+ | 1234 | 0101010101010100110101010010 | name1 = value1 |
+ | | 0101100001010100110101010010 | name2 = value2 |
+ | | 0101100001010100110101010010 | nameN = valueN |
+ \------+------------------------------+----------------/
+
+As part of maintaining data consistency and cleanliness, Ceph OSDs
+can also scrub the data. That is, Ceph OSDs can compare object metadata
+across replicas to catch OSD bugs or filesystem errors (daily). OSDs can
+also do deeper scrubbing by comparing data in objects bit-for-bit to find
+bad sectors on a disk that weren't apparent in a light scrub (weekly).
+
+.. todo:: explain "classes"
+
+.. _Placement Group States: ../cluster-ops/pg-states
+.. _Placement Group Concepts: ../cluster-ops/pg-concepts
+
+Monitor Quorums
===============
-The Ceph filesystem service is provided by a daemon called
-``ceph-mds``. It uses RADOS to store all the filesystem metadata
-(directories, file ownership, access modes, etc), and directs clients
-to access RADOS directly for the file contents.
+Ceph's monitors maintain a master copy of the cluster map. So Ceph daemons and
+clients merely contact the monitor periodically to ensure they have the most
+recent copy of the cluster map. Ceph monitors are light-weight processes, but
+for added reliability and fault tolerance, Ceph supports distributed monitors.
+Ceph must have agreement among various monitor instances regarding the state of
+the cluster. To establish a consensus, Ceph always uses an odd number of
+monitors (e.g., 1, 3, 5, 7, etc) and the `Paxos`_ algorithm in order to
+establish a consensus.
+
+.. _Paxos: http://en.wikipedia.org/wiki/Paxos_(computer_science)
+
+MDS
+===
+
+The Ceph filesystem service is provided by a daemon called ``ceph-mds``. It uses
+RADOS to store all the filesystem metadata (directories, file ownership, access
+modes, etc), and directs clients to access RADOS directly for the file contents.
+The Ceph filesystem aims for POSIX compatibility. ``ceph-mds`` can run as a
+single process, or it can be distributed out to multiple physical machines,
+either for high availability or for scalability.
+
+- **High Availability**: The extra ``ceph-mds`` instances can be `standby`,
+ ready to take over the duties of any failed ``ceph-mds`` that was
+ `active`. This is easy because all the data, including the journal, is
+ stored on RADOS. The transition is triggered automatically by ``ceph-mon``.
+
+- **Scalability**: Multiple ``ceph-mds`` instances can be `active`, and they
+ will split the directory tree into subtrees (and shards of a single
+ busy directory), effectively balancing the load amongst all `active`
+ servers.
-The Ceph filesystem aims for POSIX compatibility, except for a few
-chosen differences. See :doc:`/appendix/differences-from-posix`.
+Combinations of `standby` and `active` etc are possible, for example
+running 3 `active` ``ceph-mds`` instances for scaling, and one `standby`
+intance for high availability.
-``ceph-mds`` can run as a single process, or it can be distributed out to
-multiple physical machines, either for high availability or for
-scalability.
-For high availability, the extra ``ceph-mds`` instances can be `standby`,
-ready to take over the duties of any failed ``ceph-mds`` that was
-`active`. This is easy because all the data, including the journal, is
-stored on RADOS. The transition is triggered automatically by
-``ceph-mon``.
+Client Interfaces
+=================
-For scalability, multiple ``ceph-mds`` instances can be `active`, and they
-will split the directory tree into subtrees (and shards of a single
-busy directory), effectively balancing the load amongst all `active`
-servers.
+librados
+--------
-Combinations of `standby` and `active` etc are possible, for example
-running 3 `active` ``ceph-mds`` instances for scaling, and one `standby`.
+.. todo:: Cephx. Summarize how much Ceph trusts the client, for what parts (security vs reliability).
+.. todo:: Access control
+.. todo:: Snapshotting, Import/Export, Backup
+.. todo:: native APIs
-To control the number of `active` ``ceph-mds``\es, see
-:doc:`/ops/manage/grow/mds`.
+RBD
+---
-.. topic:: Status as of 2011-09:
+RBD stripes a block device image over multiple objects in the cluster, where
+each object gets mapped to a placement group and distributed, and the placement
+groups are spread across separate ``ceph-osd`` daemons throughout the cluster.
- Multiple `active` ``ceph-mds`` operation is stable under normal
- circumstances, but some failure scenarios may still cause
- operational issues.
+.. important:: Striping allows RBD block devices to perform better than a single server could!
-.. todo:: document `standby-replay`
+RBD's thin-provisioned snapshottable block devices are an attractive option for
+virtualization and cloud computing. In virtual machine scenarios, people
+typically deploy RBD with the ``rbd`` network storage driver in Qemu/KVM, where
+the host machine uses ``librbd`` to provide a block device service to the guest.
+Many cloud computing stacks use ``libvirt`` to integrate with hypervisors. You
+can use RBD thin-provisioned block devices with Qemu and libvirt to support
+OpenStack and CloudStack among other solutions.
-.. todo:: mds.0 vs mds.alpha etc details
+While we do not provide ``librbd`` support with other hypervisors at this time, you may
+also use RBD kernel objects to provide a block device to a client. Other virtualization
+technologies such as Xen can access the RBD kernel object(s). This is done with the
+command-line tool ``rbd``.
-.. index:: RADOS Gateway, radosgw
-.. _radosgw:
+RGW
+---
-``radosgw``
-===========
+The RADOS Gateway daemon, ``radosgw``, is a FastCGI service that provides a
+RESTful_ HTTP API to store objects and metadata. It layers on top of RADOS with
+its own data formats, and maintains it's own user database, authentication, and
+access control. The RADOS Gateway uses a unified namespace, which means you can
+use either the OpenStack Swift-compatible API or the Amazon S3-compatible API.
+For example, you can write data using the S3-comptable API with one application
+and then read data using the Swift-compatible API with another application.
-``radosgw`` is a FastCGI service that provides a RESTful_ HTTP API to
-store objects and metadata. It layers on top of RADOS with its own
-data formats, and maintains it's own user database, authentication,
-access control, and so on.
+See `RADOS Gateway`_ for details.
+.. _RADOS Gateway: ../radosgw/
.. _RESTful: http://en.wikipedia.org/wiki/RESTful
.. index:: RBD, Rados Block Device
-.. _rbd:
-
-Rados Block Device (RBD)
-========================
-
-In virtual machine scenarios, RBD is typically used via the ``rbd``
-network storage driver in Qemu/KVM, where the host machine uses
-``librbd`` to provide a block device service to the guest.
-
-Alternatively, as no direct ``librbd`` support is available in Xen,
-the Linux kernel can act as the RBD client and provide a real block
-device on the host machine, that can then be accessed by the
-virtualization. This is done with the command-line tool ``rbd`` (see
-:doc:`/ops/rbd`).
-
-The latter is also useful in non-virtualized scenarios.
-
-Internally, RBD stripes the device image over multiple RADOS objects,
-each typically located on a separate ``ceph-osd``, allowing it to perform
-better than a single server could.
-
-
-Client
-======
-
-.. todo:: cephfs, ceph-fuse, librados, libcephfs, librbd
-
-.. todo:: Summarize how much Ceph trusts the client, for what parts (security vs reliability).
-TODO
-====
+CephFS
+------
-.. todo:: Example scenarios Ceph projects are/not suitable for
+.. todo:: cephfs, ceph-fuse
\ No newline at end of file