From 81feae12e14999a7d62f4f06b4d56d86915bbe30 Mon Sep 17 00:00:00 2001 From: Tommi Virtanen Date: Wed, 14 Dec 2011 10:47:38 -0800 Subject: [PATCH] doc: Add misc explanations of Ceph internals from email. Signed-off-by: Tommi Virtanen --- doc/dev/delayed-delete.rst | 12 ++++ doc/dev/documenting.rst | 62 +++++++++++++++++ doc/dev/filestore-filesystem-compat.rst | 77 +++++++++++++++++++++ doc/dev/osd-class-path.rst | 16 +++++ doc/dev/placement-group.rst | 90 +++++++++++++++++++++++++ 5 files changed, 257 insertions(+) create mode 100644 doc/dev/delayed-delete.rst create mode 100644 doc/dev/documenting.rst create mode 100644 doc/dev/filestore-filesystem-compat.rst create mode 100644 doc/dev/osd-class-path.rst create mode 100644 doc/dev/placement-group.rst diff --git a/doc/dev/delayed-delete.rst b/doc/dev/delayed-delete.rst new file mode 100644 index 0000000000000..bf5f65a460893 --- /dev/null +++ b/doc/dev/delayed-delete.rst @@ -0,0 +1,12 @@ +========================= + CephFS delayed deletion +========================= + +When you delete a file, the data is not immediately removed. Each +object in the file needs to be removed independently, and sending +``size_of_file / stripe_size * replication_count`` messages would slow +the client down too much, and use a too much of the clients +bandwidth. Additionally, snapshots may mean some objects should not be +deleted. + +Instead, the file is marked as deleted on the MDS, and deleted lazily. diff --git a/doc/dev/documenting.rst b/doc/dev/documenting.rst new file mode 100644 index 0000000000000..870703f787fb3 --- /dev/null +++ b/doc/dev/documenting.rst @@ -0,0 +1,62 @@ +================== + Documenting Ceph +================== + +Drawing diagrams +================ + +Graphviz +-------- + +You can use Graphviz_, as explained in the `Graphviz extension documentation`_. + +.. _Graphviz: http://graphviz.org/ +.. _`Graphviz extension documentation`: http://sphinx.pocoo.org/ext/graphviz.html + +.. graphviz:: + + digraph "example" { + foo -> bar; + bar -> baz; + bar -> thud; + } + +Most of the time, you'll want to put the actual DOT source in a +separate file, like this:: + + .. graphviz:: myfile.dot + + +Ditaa +----- + +You can use Ditaa_: + +.. _Ditaa: http://ditaa.sourceforge.net/ + +.. ditaa:: + + +--------------+ /=----\ + | hello, world |-->| hi! | + +--------------+ \-----/ + + +Blockdiag +--------- + +If a use arises, we can integrate Blockdiag_. It is a Graphviz-style +declarative language for drawing things, and includes: + +- `block diagrams`_: boxes and arrows (automatic layout, as opposed to + Ditaa_) +- `sequence diagrams`_: timelines and messages between them +- `activity diagrams`_: subsystems and activities in them +- `network diagrams`_: hosts, LANs, IP addresses etc (with `Cisco + icons`_ if wanted) + +.. _Blockdiag: http://blockdiag.com/ +.. _`Cisco icons`: http://pypi.python.org/pypi/blockdiagcontrib-cisco/ +.. _`block diagrams`: http://blockdiag.com/en/blockdiag/ +.. _`sequence diagrams`: http://blockdiag.com/en/seqdiag/index.html +.. _`activity diagrams`: http://blockdiag.com/en/actdiag/index.html +.. _`network diagrams`: http://blockdiag.com/en/nwdiag/ diff --git a/doc/dev/filestore-filesystem-compat.rst b/doc/dev/filestore-filesystem-compat.rst new file mode 100644 index 0000000000000..622569245f103 --- /dev/null +++ b/doc/dev/filestore-filesystem-compat.rst @@ -0,0 +1,77 @@ +==================================== + Filestore filesystem compatilibity +==================================== + +http://marc.info/?l=ceph-devel&m=131942130322957&w=2 + +Although running on ext4, xfs, or whatever other non-btrfs you want mostly +works, there are a few important remaining issues: + +ext4 limits total xattrs for 4KB +================================ + +This can cause problems in some cases, as Ceph uses xattrs +extensively. Most of the time we don't hit this. We do hit the limit +with radosgw pretty easily, though, and may also hit it in exceptional +cases where the OSD cluster is very unhealthy. + +There is a large xattr patch for ext4 from the Lustre folks that has been +floating around for (I think) years. Maybe as interest grows in running +Ceph on ext4 this can move upstream. + +Previously we were being forgiving about large setxattr failures on ext3, +but we found that was leading to corruption in certain cases (because we +couldn't set our internal metadata), so the next release will assert/crash +in that case (fail-stop instead of fail-maybe-eventually-corrupt). + +XFS does not have an xattr size limit and thus does have this problem. + + +OSD journal replay of non-idempotent transactions +================================================= + +**Resolved** with full sync but not ideal. +See http://tracker.newdream.net/issues/213 + +On non-btrfs backends, the Ceph OSDs use a write-ahead journal. After +restart, the OSD does not know exactly which transactions in the +journal may have already been committed to disk, and may reapply a +transaction again during replay. For most operations (write, delete, +truncate) this is fine. + +Some operations, though, are non-idempotent. The simplest example is +CLONE, which copies (efficiently, on btrfs) data from one object to +another. If the source object is modified, the osd restarts, and then +the clone is replayed, the target will get incorrect (newer) data. For +example, + +- clone A -> B +- modify A +- + +B will get new instead of old contents. + +(This doesn't happen on btrfs because the snapshots allow us to replay +from a known consistent point in time.) + +Possibilities: + +- full sync after any non-idempotent operation +- re-evaluate the lower level interface based on needs from higher + levels, construct only safe operations, be very careful; brittle +- use xattrs to add sequence numbers to objects: + + - on non-btrfs, we set a xattr on every modified object with the + op_seq, the unique sequence number for the transaction. + - for any (potentially) non-idempotent operation, we fsync() before + continuing to the next transaction, to ensure that xattr hits disk. + - on replay, we skip a transaction if the xattr indicates we already + performed this transaction. + + Because every 'transaction' only modifies on a single object (file), + this ought to work. It'll make things like clone slow, but let's + face it: they're already slow on non-btrfs file systems because they + actually copy the data (instead of duplicating the extent refs in + btrfs). And it should make the full ObjectStore iterface safe, + without upper layers having to worry about the kinds and orders of + transactions they perform. diff --git a/doc/dev/osd-class-path.rst b/doc/dev/osd-class-path.rst new file mode 100644 index 0000000000000..10d8f73f856b1 --- /dev/null +++ b/doc/dev/osd-class-path.rst @@ -0,0 +1,16 @@ +======================= + OSD class path issues +======================= + +:: + + 2011-12-05 17:41:00.994075 7ffe8b5c3760 librbd: failed to assign a block name for image + create error: error 5: Input/output error + +This usually happens because your osds can't find ``cls_rbd.so``. They +search for it in ``osd_class_dir``, which may not be set correctly by +default (http://tracker.newdream.net/issues/1722). + +Most likely it's looking in ``/usr/lib/rados-classes`` instead of +``/usr/lib64/rados-classes`` - change ``osd_class_dir`` in your +``ceph.conf`` and restart the osds to fix it. diff --git a/doc/dev/placement-group.rst b/doc/dev/placement-group.rst new file mode 100644 index 0000000000000..5755277bcc7a3 --- /dev/null +++ b/doc/dev/placement-group.rst @@ -0,0 +1,90 @@ +============================ + PG (Placement Group) notes +============================ + +Miscellaneous copy-pastes from emails, when this gets cleaned up it +should move out of /dev. + +Overview +======== + +PG = "placement group". When placing data in the cluster, objects are +mapped into PGs, and those PGs are mapped onto OSDs. We use the +indirection so that we can group objects, which reduces the amount of +per-object metadata we need to keep track of and processes we need to +run (it would be prohibitively expensive to track eg the placement +history on a per-object basis). Increasing the number of PGs can +reduce the variance in per-OSD load across your cluster, but each PG +requires a bit more CPU and memory on the OSDs that are storing it. We +try and ballpark it at 100 PGs/OSD, although it can vary widely +without ill effects depending on your cluster. You hit a bug in how we +calculate the initial PG number from a cluster description. + +There are a couple of different categories of PGs; the 6 that exist +(in the original emailer's ``ceph -s`` output) are "local" PGs which +are tied to a specific OSD. However, those aren't actually used in a +standard Ceph configuration. + + +Mapping algorithm (simplified) +============================== + +| > How does the Object->PG mapping look like, do you map more than one object on +| > one PG, or do you sometimes map an object to more than one PG? How about the +| > mapping of PGs to OSDs, does one PG belong to exactly one OSD? +| > +| > Does one PG represent a fixed amount of storage space? + +Many objects map to one PG. + +Each object maps to exactly one PG. + +One PG maps to a single list of OSDs, where the first one in the list +is the primary and the rest are replicas. + +Many PGs can map to one OSD. + +A PG represents nothing but a grouping of objects; you configure the +number of PGs you want (see +http://ceph.newdream.net/wiki/Changing_the_number_of_PGs ), number of +OSDs * 100 is a good starting point, and all of your stored objects +are pseudo-randomly evenly distributed to the PGs. So a PG explicitly +does NOT represent a fixed amount of storage; it represents 1/pg_num +'th of the storage you happen to have on your OSDs. + +Ignoring the finer points of CRUSH and custom placement, it goes +something like this in pseudocode:: + + locator = object_name + obj_hash = hash(locator) + pg = obj_hash % num_pg + osds_for_pg = crush(pg) # returns a list of osds + primary = osds_for_pg[0] + replicas = osds_for_pg[1:] + +If you want to understand the crush() part in the above, imagine a +perfectly spherical datacenter in a vacuum ;) that is, if all osds +have weight 1.0, and there is no topology to the data center (all OSDs +are on the top level), and you use defaults, etc, it simplifies to +consistent hashing; you can think of it as:: + + def crush(pg): + all_osds = ['osd.0', 'osd.1', 'osd.2', ...] + result = [] + # size is the number of copies; primary+replicas + while len(result) < size: + r = hash(pg) + chosen = all_osds[ r % len(all_osds) ] + if chosen in result: + # osd can be picked only once + continue + result.append(chosen) + return result + + +PG status refreshes only when pg mapping changes +================================================ + +The pg status currently doesn't get refreshed when the actual pg +mapping doesn't change, and e.g. a pool size change of 2->1 won't do +that. It will refresh if you restart the OSDs, though. -- 2.39.5