doc: Architecture, placeholder in install, and first appendix.

author Tommi Virtanen <tommi.virtanen@dreamhost.com>

Thu, 1 Sep 2011 20:24:06 +0000 (13:24 -0700)

committer Tommi Virtanen <tommi.virtanen@dreamhost.com>

Thu, 1 Sep 2011 20:28:12 +0000 (13:28 -0700)
author Tommi Virtanen <tommi.virtanen@dreamhost.com>
Thu, 1 Sep 2011 20:24:06 +0000 (13:24 -0700)
committer Tommi Virtanen <tommi.virtanen@dreamhost.com>
Thu, 1 Sep 2011 20:28:12 +0000 (13:28 -0700)
diff --git a/doc/appendix/differences-from-posix.rst b/doc/appendix/differences-from-posix.rst

new file mode 100644 (file)

index 0000000..f327e78
--- /dev/null
+++ b/doc/appendix/differences-from-posix.rst
@@ -0,0 +1,17 @@
+========================
+ Differences from POSIX
+========================
+
+.. todo:: delete http://ceph.newdream.net/wiki/Differences_from_POSIX
+
+Ceph does have a few places where it diverges from strict POSIX semantics for various reasons:
+
+- Sparse files propagate incorrectly to tools like df. They will only
+  use up the required space, but in df will increase the "used" space
+  by the full file size. We do this because actually keeping track of
+  the space a large, sparse file uses is very expensive.
+- In shared simultaneous writer situations, a write that crosses
+  object boundaries is not necessarily atomic. This means that you
+  could have writer A write "aa|aa" and writer B write "bb|bb"
+  simultaneously (where | is the object boundary), and end up with
+  "aa|bb" rather than the proper "aa|aa" or "bb|bb".
diff --git a/doc/appendix/index.rst b/doc/appendix/index.rst

new file mode 100644 (file)

index 0000000..a98bf89
--- /dev/null
+++ b/doc/appendix/index.rst
@@ -0,0 +1,10 @@
+============
+ Appendices
+============
+
+.. toctree::
+   :glob:
+   :numbered:
+   :titlesonly:
+
+   *
diff --git a/doc/architecture.rst b/doc/architecture.rst

index cf67f6be5051f6ac93db01bcc35beeb9341c3195..3afbe6bcc8b885a0de502168ed2c9bf51ad79c85 100644 (file)
--- a/doc/architecture.rst
+++ b/doc/architecture.rst
@@ -2,26 +2,173 @@
   Architecture of Ceph
  ======================
  
-- Introduction to Ceph Project
+Ceph is a distributed network storage and file system with distributed
+metadata management and POSIX semantics.
  
-  - High-level overview of project benefits for users (few paragraphs, mention each subproject)
-  - Introduction to sub-projects (few paragraphs to a page each)
+RADOS is a reliable object store, used by Ceph, but also directly
+accessible.
  
-    - RADOS
-    - RGW
-    - RBD
-    - Ceph
+``radosgw`` is an S3-compatible RESTful HTTP service for object
+storage, using RADOS storage.
  
-  - Example scenarios Ceph projects are/not suitable for
-  - (Very) High-Level overview of Ceph
+RBD is a Linux kernel feature that exposes RADOS storage as a block
+device. Qemu/KVM also has a direct RBD client, that avoids the kernel
+overhead.
  
-    This would include an introduction to basic project terminology,
-    the concept of OSDs, MDSes, and Monitors, and things like
-    that. What they do, some of why they're awesome, but not how they
-    work.
  
-- Discussion of MDS terminology, daemon types (active, standby,
-  standby-replay)
+Monitor cluster
+===============
  
+``cmon`` is a lightweight daemon that provides a consensus for
+distributed decisionmaking in a Ceph/RADOS cluster.
  
-.. todo:: write me
+It also is the initial point of contact for new clients, and will hand
+out information about the topology of the cluster, such as the
+``osdmap``.
+
+You normally run 3 ``cmon`` daemons, on 3 separate physical machines,
+isolated from each other; for example, in different racks or rows.
+
+You could run just 1 instance, but that means giving up on high
+availability.
+
+You may use the same hosts for ``cmon`` and other purposes.
+
+``cmon`` processes talk to each other using a Paxos_\-style
+protocol. They discover each other via the ``[mon.X] mon addr`` fields
+in ``ceph.conf``.
+
+.. todo:: What about ``monmap``? Fact check.
+
+Any decision requires the majority of the ``cmon`` processes to be
+healthy and communicating with each other. For this reason, you never
+want an even number of ``cmon``\s; there is no unambiguous majority
+subgroup for an even number.
+
+.. _Paxos: http://en.wikipedia.org/wiki/Paxos_algorithm
+
+.. todo:: explain monmap
+
+
+RADOS
+=====
+
+``cosd`` is the storage daemon that provides the RADOS service. It
+uses ``cmon`` for cluster membership, services object read/write/etc
+request from clients, and peers with other ``cosd``\s for data
+replication.
+
+The data model is fairly simple on this level. There are multiple
+named pools, and within each pool there are named objects, in a flat
+namespace (no directories). Each object has both data and metadata.
+
+The data for an object is a single, potentially big, series of
+bytes. Additionally, the series may be sparse, it may have holes that
+contain binary zeros, and take up no actual storage.
+
+The metadata is an unordered set of key-value pairs. It's semantics
+are completely up to the client; for example, the Ceph filesystem uses
+metadata to store file owner etc.
+
+.. todo:: Verify that metadata is unordered.
+
+Underneath, ``cosd`` stores the data on a local filesystem. We
+recommend using Btrfs_, but any POSIX filesystem that has extended
+attributes should work (see :ref:`xattr`).
+
+.. _Btrfs: http://en.wikipedia.org/wiki/Btrfs
+
+.. todo:: write about access control
+
+.. todo:: explain osdmap
+
+.. todo:: explain plugins ("classes")
+
+
+Ceph filesystem
+===============
+
+The Ceph filesystem service is provided by a daemon called
+``cmds``. It uses RADOS to store all the filesystem metadata
+(directories, file ownership, access modes, etc), and directs clients
+to access RADOS directly for the file contents.
+
+The Ceph filesystem aims for POSIX compatibility, except for a few
+chosen differences. See :doc:`/appendix/differences-from-posix`.
+
+``cmds`` can run as a single process, or it can be distributed out to
+multiple physical machines, either for high availability or for
+scalability.
+
+For high availability, the extra ``cmds`` instances can be `standby`,
+ready to take over the duties of any failed ``cmds`` that was
+`active`. This is easy because all the data, including the journal, is
+stored on RADOS. The transition is triggered automatically by
+``cmon``.
+
+For scalability, multiple ``cmds`` instances can be `active`, and they
+will split the directory tree into subtrees (and shards of a single
+busy directory), effectively balancing the load amongst all `active`
+servers.
+
+Combinations of `standby` and `active` etc are possible, for example
+running 3 `active` ``cmds`` instances for scaling, and one `standby`.
+
+To control the number of `active` ``cmds``\es, see :doc:`/ops/grow/mds`.
+
+.. topic:: Status as of 2011-09:
+
+   Multiple `active` ``cmds`` operation is stable under normal
+   circumstances, but some failure scenarios may still cause
+   operational issues.
+
+.. todo:: document `standby-replay`
+
+.. todo:: mds.0 vs mds.alpha etc details
+
+
+
+``radosgw``
+===========
+
+``radosgw`` is a FastCGI service that provides a RESTful_ HTTP API to
+store objects and metadata. It layers on top of RADOS with its own
+data formats, and maintains it's own user database, authentication,
+access control, and so on.
+
+.. _RESTful: http://en.wikipedia.org/wiki/RESTful
+
+
+Rados Block Device (RBD)
+========================
+
+In virtual machine scenarios, RBD is typically used via the ``rbd``
+network storage driver in Qemu/KVM, where the host machine uses
+``librbd`` to provide a block device service to the guest.
+
+Alternatively, as no direct ``librbd`` support is available in Xen,
+the Linux kernel can act as the RBD client and provide a real block
+device on the host machine, that can then be accessed by the
+virtualization. This is done with the command-line tool ``rbd`` (see
+:doc:`/ops/rbd`).
+
+The latter is also useful in non-virtualized scenarios.
+
+Internally, RBD stripes the device image over multiple RADOS objects,
+each typically located on a separate ``cosd``, allowing it to perform
+better than a single server could.
+
+
+Client
+======
+
+.. todo:: cephfs, cfuse, librados, libceph, librbd
+
+
+.. todo:: Summarize how much Ceph trusts the client, for what parts (security vs reliability).
+
+
+TODO
+====
+
+.. todo:: Example scenarios Ceph projects are/not suitable for
diff --git a/doc/index.rst b/doc/index.rst

index 195c74993bb542be2e5e6250a41bd48c38356b1d..d3ad0c669ec9566ca87917d991d46fc3bbffe527 100644 (file)
--- a/doc/index.rst
+++ b/doc/index.rst
@@ -94,6 +94,7 @@ Table of Contents
     man/index
     papers
     glossary
+   appendix/index
  
  
  Indices and tables
diff --git a/doc/ops/install.rst b/doc/ops/install.rst

index 692ac926bec3b3923da15b4b6ae60d685d555b55..1ffca6f4417af52855618ef22c4bece535882680 100644 (file)
--- a/doc/ops/install.rst
+++ b/doc/ops/install.rst
@@ -12,3 +12,22 @@ mentioning all the design tradeoffs and options like journaling
  locations or filesystems
  
  At this point, either use 1 or 3 mons, point to :doc:`grow/mon`
+
+OSD installation
+================
+
+btrfs
+-----
+
+what does btrfs give you (the journaling thing)
+
+
+ext4/ext3
+---------
+
+.. _xattr:
+
+Enabling extended attributes
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+how to enable xattr on ext4/3
author	Tommi Virtanen <tommi.virtanen@dreamhost.com>
	Thu, 1 Sep 2011 20:24:06 +0000 (13:24 -0700)
committer	Tommi Virtanen <tommi.virtanen@dreamhost.com>
	Thu, 1 Sep 2011 20:28:12 +0000 (13:28 -0700)
doc/appendix/differences-from-posix.rst	[new file with mode: 0644]	patch \| blob
doc/appendix/index.rst	[new file with mode: 0644]	patch \| blob
doc/architecture.rst		patch \| blob \| history
doc/index.rst		patch \| blob \| history
doc/ops/install.rst		patch \| blob \| history