From: Radosław Zarzyński Date: Mon, 5 Jun 2023 14:04:36 +0000 (+0200) Subject: doc: improve doc/dev/encoding.rst X-Git-Tag: v17.2.7~234^2 X-Git-Url: http://git.apps.os.sepia.ceph.com/?a=commitdiff_plain;h=313a996dad6a5382ea8fe6db476659af1d613382;p=ceph.git doc: improve doc/dev/encoding.rst Signed-off-by: Radosław Zarzyński (cherry picked from commit 83bef6668d319685923df927f32590b4e1795746) --- diff --git a/doc/dev/encoding.rst b/doc/dev/encoding.rst index 81b52500e63b4..f065d64bd1a90 100644 --- a/doc/dev/encoding.rst +++ b/doc/dev/encoding.rst @@ -3,9 +3,74 @@ Serialization (encode/decode) ============================= When a structure is sent over the network or written to disk, it is -encoded into a string of bytes. Serializable structures have -``encode`` and ``decode`` methods that write and read from ``bufferlist`` -objects representing byte strings. +encoded into a string of bytes. Usually (but not always -- multiple +serialization facilities coexist in Ceph) serializable structures +have ``encode`` and ``decode`` methods that write and read from +``bufferlist`` objects representing byte strings. + +Terminology +----------- +It is best to think not in the domain of daemons and clients but +encoders and decoders. An encoder serializes a structure into a bufferlist +while a decoder does the opposite. + +Encoders and decoders can be referred collectively as dencoders. + +Dencoders (both encoders and docoders) live within daemons and clients. +For instance, when an RBD client issues an IO operation, it prepares +an instance of the ``MOSDOp`` structure and encodes it into a bufferlist +that is put on the wire. +An OSD reads these bytes and decodes them back into an ``MOSDOp`` instance. +Here encoder was used by the client while decoder by the OSD. However, +these roles can swing -- just imagine handling of the response: OSD encodes +the ``MOSDOpReply`` while RBD clients decode. + +Encoder and decoder operate accordingly to a format which is defined +by a programmer by implementing the ``encode`` and ``decode`` methods. + +Principles for format change +---------------------------- +It is not unusual that the format of serialization changes. This +process requires careful attention from during both development +and review. + +The general rule is that a decoder must understand what had been +encoded by an encoder. Most of the problems come from ensuring +that compatibility continues between old decoders and new encoders +as well as new decoders and old decoders. One should assume +that -- if not otherwise derogated -- any mix (old/new) is +possible in a cluster. There are 2 main reasons for that: + +1. Upgrades. Although there are recommendations related to the order + of entity types (mons/osds/clients), it is not mandatory and + no assumption should be made about it. +2. Huge variability of client versions. It was always the case + that kernel (and thus kernel clients) upgrades are decoupled + from Ceph upgrades. Moreover, proliferation of containerization + bring the variability even to e.g. ``librbd`` -- now user space + libraries live on the container own. + +With this being said, there are few rules limiting the degree +of interoperability between dencoders: + +* ``n-2`` for dencoding between daemons, +* ``n-3`` hard requirement for client-involved scenarios, +* ``n-3..`` soft requirements for clinet-involved scenarios. Ideally + every client should be able to talk any version of daemons. + +As the underlying reasons are the same, the rules dencoders +follow are virtually the same as for deprecations of our features +bits. See the ``Notes on deprecation`` in ``src/include/ceph_features.h``. + +Frameworks +---------- +Currently multiple genres of dencoding helpers co-exist. + +* encoding.h (the most proliferated one), +* denc.h (performance optimized, seen mostly in ``BlueStore``), +* the `Message` hierarchy. + +Although details vary, the interoperability rules stay the same. Adding a field to a structure ----------------------------- @@ -93,3 +158,69 @@ because we might still be passed older-versioned messages that do not have the field. The ``struct_v`` variable is a local set by the ``DECODE_START`` macro. +# Into the weeeds + +The append-extendability of our dencoders is a result of the forward +compatibility that the ``ENCODE_START`` and ``DECODE_FINISH`` macros bring. + +They are implementing extendibility facilities. An encoder, when filling +the bufferlist, prepends three fields: version of the current format, +minimal version of a decoder compatible with it and the total size of +all encoded fields. + +.. code-block:: cpp + + /** + * start encoding block + * + * @param v current (code) version of the encoding + * @param compat oldest code version that can decode it + * @param bl bufferlist to encode to + * + */ + #define ENCODE_START(v, compat, bl) \ + __u8 struct_v = v; \ + __u8 struct_compat = compat; \ + ceph_le32 struct_len; \ + auto filler = (bl).append_hole(sizeof(struct_v) + \ + sizeof(struct_compat) + sizeof(struct_len)); \ + const auto starting_bl_len = (bl).length(); \ + using ::ceph::encode; \ + do { + +The ``struct_len`` field allows the decoder to eat all the bytes that were +left undecoded in the user-provided ``decode`` implementation. +Analogically, decoders tracks how much input has been decoded in the +user-provided ``decode`` methods. + +.. code-block:: cpp + + #define DECODE_START(bl) \ + unsigned struct_end = 0; \ + __u32 struct_len; \ + decode(struct_len, bl); \ + ... \ + struct_end = bl.get_off() + struct_len; \ + } \ + do { + + +Decoder uses this information to discard the extra bytes it does not +understand. Advancing bufferlist is critical as dencoders tend to be nested; +just leaving it intact would work only for the very last ``deocde`` call +in a nested structure. + +.. code-block:: cpp + + #define DECODE_FINISH(bl) \ + } while (false); \ + if (struct_end) { \ + ... \ + if (bl.get_off() < struct_end) \ + bl += struct_end - bl.get_off(); \ + } + + +This entire, cooperative mechanism allows encoder (its further revisions) +to generate more byte stream (due to e.g. adding a new field at the end) +and not worry that the residue will crash older decoder revisions.