From 81feae12e14999a7d62f4f06b4d56d86915bbe30 Mon Sep 17 00:00:00 2001
From: Tommi Virtanen <tommi.virtanen@dreamhost.com>
Date: Wed, 14 Dec 2011 10:47:38 -0800
Subject: [PATCH] doc: Add misc explanations of Ceph internals from email.

Signed-off-by: Tommi Virtanen <tommi.virtanen@dreamhost.com>
---
 doc/dev/delayed-delete.rst              | 12 ++++
 doc/dev/documenting.rst                 | 62 +++++++++++++++++
 doc/dev/filestore-filesystem-compat.rst | 77 +++++++++++++++++++++
 doc/dev/osd-class-path.rst              | 16 +++++
 doc/dev/placement-group.rst             | 90 +++++++++++++++++++++++++
 5 files changed, 257 insertions(+)
 create mode 100644 doc/dev/delayed-delete.rst
 create mode 100644 doc/dev/documenting.rst
 create mode 100644 doc/dev/filestore-filesystem-compat.rst
 create mode 100644 doc/dev/osd-class-path.rst
 create mode 100644 doc/dev/placement-group.rst

diff --git a/doc/dev/delayed-delete.rst b/doc/dev/delayed-delete.rst
new file mode 100644
index 0000000000000..bf5f65a460893
--- /dev/null
+++ b/doc/dev/delayed-delete.rst
@@ -0,0 +1,12 @@
+=========================
+ CephFS delayed deletion
+=========================
+
+When you delete a file, the data is not immediately removed. Each
+object in the file needs to be removed independently, and sending
+``size_of_file / stripe_size * replication_count`` messages would slow
+the client down too much, and use a too much of the clients
+bandwidth. Additionally, snapshots may mean some objects should not be
+deleted.
+
+Instead, the file is marked as deleted on the MDS, and deleted lazily.
diff --git a/doc/dev/documenting.rst b/doc/dev/documenting.rst
new file mode 100644
index 0000000000000..870703f787fb3
--- /dev/null
+++ b/doc/dev/documenting.rst
@@ -0,0 +1,62 @@
+==================
+ Documenting Ceph
+==================
+
+Drawing diagrams
+================
+
+Graphviz
+--------
+
+You can use Graphviz_, as explained in the `Graphviz extension documentation`_.
+
+.. _Graphviz: http://graphviz.org/
+.. _`Graphviz extension documentation`: http://sphinx.pocoo.org/ext/graphviz.html
+
+.. graphviz::
+
+   digraph "example" {
+     foo -> bar;
+     bar -> baz;
+     bar -> thud;
+   }
+
+Most of the time, you'll want to put the actual DOT source in a
+separate file, like this::
+
+  .. graphviz:: myfile.dot
+
+
+Ditaa
+-----
+
+You can use Ditaa_:
+
+.. _Ditaa: http://ditaa.sourceforge.net/
+
+.. ditaa::
+
+   +--------------+   /=----\
+   | hello, world |-->| hi! |
+   +--------------+   \-----/
+
+
+Blockdiag
+---------
+
+If a use arises, we can integrate Blockdiag_. It is a Graphviz-style
+declarative language for drawing things, and includes:
+
+- `block diagrams`_: boxes and arrows (automatic layout, as opposed to
+  Ditaa_)
+- `sequence diagrams`_: timelines and messages between them
+- `activity diagrams`_: subsystems and activities in them
+- `network diagrams`_: hosts, LANs, IP addresses etc (with `Cisco
+  icons`_ if wanted)
+
+.. _Blockdiag: http://blockdiag.com/
+.. _`Cisco icons`: http://pypi.python.org/pypi/blockdiagcontrib-cisco/
+.. _`block diagrams`: http://blockdiag.com/en/blockdiag/
+.. _`sequence diagrams`: http://blockdiag.com/en/seqdiag/index.html
+.. _`activity diagrams`: http://blockdiag.com/en/actdiag/index.html
+.. _`network diagrams`: http://blockdiag.com/en/nwdiag/
diff --git a/doc/dev/filestore-filesystem-compat.rst b/doc/dev/filestore-filesystem-compat.rst
new file mode 100644
index 0000000000000..622569245f103
--- /dev/null
+++ b/doc/dev/filestore-filesystem-compat.rst
@@ -0,0 +1,77 @@
+====================================
+ Filestore filesystem compatilibity
+====================================
+
+http://marc.info/?l=ceph-devel&m=131942130322957&w=2
+
+Although running on ext4, xfs, or whatever other non-btrfs you want mostly
+works, there are a few important remaining issues:
+
+ext4 limits total xattrs for 4KB
+================================
+
+This can cause problems in some cases, as Ceph uses xattrs
+extensively. Most of the time we don't hit this. We do hit the limit
+with radosgw pretty easily, though, and may also hit it in exceptional
+cases where the OSD cluster is very unhealthy.
+
+There is a large xattr patch for ext4 from the Lustre folks that has been
+floating around for (I think) years. Maybe as interest grows in running
+Ceph on ext4 this can move upstream.
+
+Previously we were being forgiving about large setxattr failures on ext3,
+but we found that was leading to corruption in certain cases (because we
+couldn't set our internal metadata), so the next release will assert/crash
+in that case (fail-stop instead of fail-maybe-eventually-corrupt).
+
+XFS does not have an xattr size limit and thus does have this problem.
+
+
+OSD journal replay of non-idempotent transactions
+=================================================
+
+**Resolved** with full sync but not ideal.
+See http://tracker.newdream.net/issues/213
+
+On non-btrfs backends, the Ceph OSDs use a write-ahead journal. After
+restart, the OSD does not know exactly which transactions in the
+journal may have already been committed to disk, and may reapply a
+transaction again during replay. For most operations (write, delete,
+truncate) this is fine.
+
+Some operations, though, are non-idempotent. The simplest example is
+CLONE, which copies (efficiently, on btrfs) data from one object to
+another. If the source object is modified, the osd restarts, and then
+the clone is replayed, the target will get incorrect (newer) data. For
+example,
+
+- clone A -> B
+- modify A
+- <osd crash, replay from 1>
+
+B will get new instead of old contents.
+
+(This doesn't happen on btrfs because the snapshots allow us to replay
+from a known consistent point in time.)
+
+Possibilities:
+
+- full sync after any non-idempotent operation
+- re-evaluate the lower level interface based on needs from higher
+  levels, construct only safe operations, be very careful; brittle
+- use xattrs to add sequence numbers to objects:
+
+  - on non-btrfs, we set a xattr on every modified object with the
+    op_seq, the unique sequence number for the transaction.
+  - for any (potentially) non-idempotent operation, we fsync() before
+    continuing to the next transaction, to ensure that xattr hits disk.
+  - on replay, we skip a transaction if the xattr indicates we already
+    performed this transaction.
+
+  Because every 'transaction' only modifies on a single object (file),
+  this ought to work. It'll make things like clone slow, but let's
+  face it: they're already slow on non-btrfs file systems because they
+  actually copy the data (instead of duplicating the extent refs in
+  btrfs). And it should make the full ObjectStore iterface safe,
+  without upper layers having to worry about the kinds and orders of
+  transactions they perform.
diff --git a/doc/dev/osd-class-path.rst b/doc/dev/osd-class-path.rst
new file mode 100644
index 0000000000000..10d8f73f856b1
--- /dev/null
+++ b/doc/dev/osd-class-path.rst
@@ -0,0 +1,16 @@
+=======================
+ OSD class path issues
+=======================
+
+::
+
+  2011-12-05 17:41:00.994075 7ffe8b5c3760 librbd: failed to assign a block name for image
+  create error: error 5: Input/output error
+
+This usually happens because your osds can't find ``cls_rbd.so``. They
+search for it in ``osd_class_dir``, which may not be set correctly by
+default (http://tracker.newdream.net/issues/1722).
+
+Most likely it's looking in ``/usr/lib/rados-classes`` instead of
+``/usr/lib64/rados-classes`` - change ``osd_class_dir`` in your
+``ceph.conf`` and restart the osds to fix it.
diff --git a/doc/dev/placement-group.rst b/doc/dev/placement-group.rst
new file mode 100644
index 0000000000000..5755277bcc7a3
--- /dev/null
+++ b/doc/dev/placement-group.rst
@@ -0,0 +1,90 @@
+============================
+ PG (Placement Group) notes
+============================
+
+Miscellaneous copy-pastes from emails, when this gets cleaned up it
+should move out of /dev.
+
+Overview
+========
+
+PG = "placement group". When placing data in the cluster, objects are
+mapped into PGs, and those PGs are mapped onto OSDs. We use the
+indirection so that we can group objects, which reduces the amount of
+per-object metadata we need to keep track of and processes we need to
+run (it would be prohibitively expensive to track eg the placement
+history on a per-object basis). Increasing the number of PGs can
+reduce the variance in per-OSD load across your cluster, but each PG
+requires a bit more CPU and memory on the OSDs that are storing it. We
+try and ballpark it at 100 PGs/OSD, although it can vary widely
+without ill effects depending on your cluster. You hit a bug in how we
+calculate the initial PG number from a cluster description.
+
+There are a couple of different categories of PGs; the 6 that exist
+(in the original emailer's ``ceph -s`` output) are "local" PGs which
+are tied to a specific OSD. However, those aren't actually used in a
+standard Ceph configuration.
+
+
+Mapping algorithm (simplified)
+==============================
+
+| > How does the Object->PG mapping look like, do you map more than one object on
+| > one PG, or do you sometimes map an object to more than one PG? How about the
+| > mapping of PGs to OSDs, does one PG belong to exactly one OSD?
+| >
+| > Does one PG represent a fixed amount of storage space?
+
+Many objects map to one PG.
+
+Each object maps to exactly one PG.
+
+One PG maps to a single list of OSDs, where the first one in the list
+is the primary and the rest are replicas.
+
+Many PGs can map to one OSD.
+
+A PG represents nothing but a grouping of objects; you configure the
+number of PGs you want (see
+http://ceph.newdream.net/wiki/Changing_the_number_of_PGs ), number of
+OSDs * 100 is a good starting point, and all of your stored objects
+are pseudo-randomly evenly distributed to the PGs. So a PG explicitly
+does NOT represent a fixed amount of storage; it represents 1/pg_num
+'th of the storage you happen to have on your OSDs.
+
+Ignoring the finer points of CRUSH and custom placement, it goes
+something like this in pseudocode::
+
+	locator = object_name
+	obj_hash = hash(locator)
+	pg = obj_hash % num_pg
+	osds_for_pg = crush(pg)  # returns a list of osds
+	primary = osds_for_pg[0]
+	replicas = osds_for_pg[1:]
+
+If you want to understand the crush() part in the above, imagine a
+perfectly spherical datacenter in a vacuum ;) that is, if all osds
+have weight 1.0, and there is no topology to the data center (all OSDs
+are on the top level), and you use defaults, etc, it simplifies to
+consistent hashing; you can think of it as::
+
+	def crush(pg):
+	   all_osds = ['osd.0', 'osd.1', 'osd.2', ...]
+	   result = []
+	   # size is the number of copies; primary+replicas
+	   while len(result) < size:
+	       r = hash(pg)
+	       chosen = all_osds[ r % len(all_osds) ]
+	       if chosen in result:
+	           # osd can be picked only once
+	           continue
+	       result.append(chosen)
+	   return result
+
+
+PG status refreshes only when pg mapping changes
+================================================
+
+The pg status currently doesn't get refreshed when the actual pg
+mapping doesn't change, and e.g. a pool size change of 2->1 won't do
+that. It will refresh if you restart the OSDs, though.
-- 
2.39.5