From 5eea038b71420f6e84aa481b845376f18c2fe3bc Mon Sep 17 00:00:00 2001 From: Ilya Dryomov Date: Tue, 12 May 2020 11:45:30 +0200 Subject: [PATCH] doc/dev/msgr2: fix inconsistencies and update for msgr2.1 Signed-off-by: Ilya Dryomov --- doc/dev/msgr2.rst | 384 ++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 333 insertions(+), 51 deletions(-) diff --git a/doc/dev/msgr2.rst b/doc/dev/msgr2.rst index 8dfdbcf565a..585dc34d208 100644 --- a/doc/dev/msgr2.rst +++ b/doc/dev/msgr2.rst @@ -1,7 +1,7 @@ .. _msgr2-protocol: -msgr2 protocol -============== +msgr2 protocol (msgr2.0 and msgr2.1) +==================================== This is a revision of the legacy Ceph on-wire protocol that was implemented by the SimpleMessenger. It addresses performance and @@ -20,7 +20,7 @@ This protocol revision has several goals relative to the original protocol: (e.g., padding) that keep computation and memory copies out of the fast path where possible. * *Signing*. We will allow for traffic to be signed (but not - necessarily encrypted). This may not be implemented in the initial version. + necessarily encrypted). This is not implemented. Definitions ----------- @@ -56,10 +56,19 @@ Banner Both the client and server, upon connecting, send a banner:: - "ceph %x %x\n", protocol_features_suppored, protocol_features_required + "ceph v2\n" + __le16 banner payload length + banner payload -The protocol features are a new, distinct namespace. Initially no -features are defined or required, so this will be "ceph 0 0\n". +A banner payload has the form:: + + __le64 peer_supported_features + __le64 peer_required_features + +This is a new, distinct feature bit namespace (CEPH_MSGR2_*). +Currently, only CEPH_MSGR2_FEATURE_REVISION_1 is defined. It is +supported but not required, so that msgr2.0 and msgr2.1 peers +can talk to each other. If the remote party advertises required features we don't support, we can disconnect. @@ -81,27 +90,150 @@ can disconnect. Frame format ------------ -All further data sent or received is contained by a frame. Each frame has -the form:: +After the banners are exchanged, all further communication happens +in frames. The exact format of the frame depends on the connection +mode (msgr2.0-crc, msgr2.0-secure, msgr2.1-crc or msgr2.1-secure). +All connections start in crc mode (either msgr2.0-crc or msgr2.1-crc, +depending on peer_supported_features from the banner). + +Each frame has a 32-byte preamble:: + + __u8 tag + __u8 number of segments + { + __le32 segment length + __le16 segment alignment + } * 4 + reserved (2 bytes) + __le32 preamble crc + +An empty frame has one empty segment. A non-empty frame can have +between one and four segments, all segments except the last may be +empty. + +If there are less than four segments, unused (trailing) segment +length and segment alignment fields are zeroed. + +The reserved bytes are zeroed. + +The preamble checksum is CRC32-C. It covers everything up to +itself (28 bytes) and is calculated and verified irrespective of +the connection mode (i.e. even if the frame is encrypted). + +### msgr2.0-crc mode + +A msgr2.0-crc frame has the form:: + + preamble (32 bytes) + { + segment payload + } * number of segments + epilogue (17 bytes) + +where epilogue is:: + + __u8 late_flags + { + __le32 segment crc + } * 4 + +late_flags is used for frame abortion. After transmitting the +preamble and the first segment, the sender can fill the remaining +segments with zeros and set a flag to indicate that the receiver must +drop the frame. This allows the sender to avoid extra buffering +when a frame that is being put on the wire is revoked (i.e. yanked +out of the messenger): payload buffers can be unpinned and handed +back to the user immediately, without making a copy or blocking +until the whole frame is transmitted. Currently this is used only +by the kernel client, see ceph_msg_revoke(). + +The segment checksum is CRC32-C. For "used" empty segments, it is +set to (__le32)-1. For unused (trailing) segments, it is zeroed. + +The crcs are calculated just to protect against bit errors. +No authenticity guarantees are provided, unlike in msgr1 which +attempted to provide some authenticity guarantee by optionally +signing segment lengths and crcs with the session key. + +Issues: + +1. As part of introducing a structure for a generic frame with + variable number of segments suitable for both control and + message frames, msgr2.0 moved the crc of the first segment of + the message frame (ceph_msg_header2) into the epilogue. + + As a result, ceph_msg_header2 can no longer be safely + interpreted before the whole frame is read off the wire. + This is a regression from msgr1, because in order to scatter + the payload directly into user-provided buffers and thus avoid + extra buffering and copying when receiving message frames, + ceph_msg_header2 must be available in advance -- it stores + the transaction id which the user buffers are keyed on. + The implementation has to choose between forgoing this + optimization or acting on an unverified segment. + +2. late_flags is not covered by any crc. Since it stores the + abort flag, a single bit flip can result in a completed frame + being dropped (causing the sender to hang waiting for a reply) + or, worse, in an aborted frame with garbage segment payloads + being dispatched. + + This was the case with msgr1 and got carried over to msgr2.0. + +### msgr2.1-crc mode + +Differences from msgr2.0-crc: + +1. The crc of the first segment is stored at the end of the + first segment, not in the epilogue. The epilogue stores up to + three crcs, not up to four. + + If the first segment is empty, (__le32)-1 crc is not generated. + +2. The epilogue is generated only if the frame has more than one + segment (i.e. at least one of second to fourth segments is not + empty). Rationale: If the frame has only one segment, it cannot + be aborted and there are no crcs to store in the epilogue. - frame_len (le32) - tag (TAG_* le32) - frame_header_checksum (le32) - payload - [payload padding -- only present after stream auth phase] - [signature -- only present after stream auth phase] +3. Unchecksummed late_flags is replaced with late_status which + builds in bit error detection by using a 4-bit nibble per flag + and two code words that are Hamming Distance = 4 apart (and not + all zeros or ones). This comes at the expense of having only + one reserved flag, of course. +Some example frames: -* The frame_header_checksum is over just the frame_len and tag values (8 bytes). +* A 0+0+0+0 frame (empty, no epilogue):: -* frame_len includes everything after the frame_len le32 up to the end of the - frame (all payloads, signatures, and padding). + preamble (32 bytes) -* The payload format and length is determined by the tag. +* A 20+0+0+0 frame (no epilogue):: -* The signature portion is only present if the authentication phase - has completed (TAG_AUTH_DONE has been sent) and signatures are - enabled. + preamble (32 bytes) + segment1 payload (20 bytes) + __le32 segment1 crc + +* A 0+70+0+0 frame:: + + preamble (32 bytes) + segment2 payload (70 bytes) + epilogue (13 bytes) + +* A 20+70+0+350 frame:: + + preamble (32 bytes) + segment1 payload (20 bytes) + __le32 segment1 crc + segment2 payload (70 bytes) + segment4 payload (350 bytes) + epilogue (13 bytes) + +where epilogue is:: + + __u8 late_status + { + __le32 segment crc + } * 3 Hello ----- @@ -204,47 +336,197 @@ authentication method as the first attempt: Post-auth frame format ---------------------- -The frame format is fixed (see above), but can take three different -forms, depending on the AUTH_DONE flags: +Depending on the negotiated connection mode from TAG_AUTH_DONE, the +connection either stays in crc mode or switches to the corresponding +secure mode (msgr2.0-secure or msgr2.1-secure). + +### msgr2.0-secure mode + +A msgr2.0-secure frame has the form:: + + { + preamble (32 bytes) + { + segment payload + zero padding (out to 16 bytes) + } * number of segments + epilogue (16 bytes) + } ^ AES-128-GCM cipher + auth tag (16 bytes) + +where epilogue is:: + + __u8 late_flags + zero padding (15 bytes) + +late_flags has the same meaning as in msgr2.0-crc mode. + +Each segment and the epilogue are zero padded out to 16 bytes. +Technically, GCM doesn't require any padding because Counter mode +(the C in GCM) essentially turns a block cipher into a stream cipher. +But, if the overall input length is not a multiple of 16 bytes, some +implicit zero padding would occur internally because GHASH function +used by GCM for generating auth tags only works on 16-byte blocks. + +Issues: + +1. The sender encrypts the whole frame using a single nonce + and generating a single auth tag. Because segment lengths are + stored in the preamble, the receiver has no choice but to decrypt + and interpret the preamble without verifying the auth tag -- it + can't even tell how much to read off the wire to get the auth tag + otherwise! This creates a decryption oracle, which, in conjunction + with Counter mode malleability, could lead to recovery of sensitive + information. + + This issue extends to the first segment of the message frame as + well. As in msgr2.0-crc mode, ceph_msg_header2 cannot be safely + interpreted before the whole frame is read off the wire. + +2. Deterministic nonce construction with a 4-byte counter field + followed by an 8-byte fixed field is used. The initial values are + taken from the connection secret -- a random byte string generated + during the authentication phase. Because the counter field is + only four bytes long, it can wrap and then repeat in under a day, + leading to GCM nonce reuse and therefore a potential complete + loss of both authenticity and confidentiality for the connection. + This was addressed by disconnecting before the counter repeats + (CVE-2020-1759). + +### msgr2.1-secure mode + +Differences from msgr2.0-secure: + +1. The preamble, the first segment and the rest of the frame are + encrypted separately, using separate nonces and generating + separate auth tags. This gets rid of unverified plaintext use + and keeps msgr2.1-secure mode close to msgr2.1-crc mode, allowing + the implementation to receive message frames in a similar fashion + (little to no buffering, same scatter/gather logic, etc). + + In order to reduce the number of en/decryption operations per + frame, the preamble is grown by a fixed size inline buffer (48 + bytes) that the first segment is inlined into, either fully or + partially. The preamble auth tag covers both the preamble and the + inline buffer, so if the first segment is small enough to be fully + inlined, it becomes available after a single decryption operation. + +2. As in msgr2.1-crc mode, the epilogue is generated only if the + frame has more than one segment. The rationale is even stronger, + as it would require an extra en/decryption operation. + +3. For consistency with msgr2.1-crc mode, late_flags is replaced + with late_status (the built-in bit error detection isn't really + needed in secure mode). + +4. In accordance with `NIST Recommendation for GCM`_, deterministic + nonce construction with a 4-byte fixed field followed by an 8-byte + counter field is used. An 8-byte counter field should never repeat + but the nonce reuse protection put in place for msgr2.0-secure mode + is still there. + + The initial values are the same as in msgr2.0-secure mode. + + .. _`NIST Recommendation for GCM`: https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-38d.pdf + +As in msgr2.0-secure mode, each segment is zero padded out to +16 bytes. If the first segment is fully inlined, its padding goes +to the inline buffer. Otherwise, the padding is on the remainder. +The corollary to this is that the inline buffer is consumed in +16-byte chunks. + +The unused portion of the inline buffer is zeroed. + +Some example frames: + +* A 0+0+0+0 frame (empty, nothing to inline, no epilogue):: + + { + preamble (32 bytes) + zero padding (48 bytes) + } ^ AES-128-GCM cipher + auth tag (16 bytes) + +* A 20+0+0+0 frame (first segment fully inlined, no epilogue):: + + { + preamble (32 bytes) + segment1 payload (20 bytes) + zero padding (28 bytes) + } ^ AES-128-GCM cipher + auth tag (16 bytes) -* If neither FLAG_SIGNED or FLAG_ENCRYPTED is specified, things are simple:: +* A 0+70+0+0 frame (nothing to inline):: - frame_len - tag - payload - payload_padding (out to auth block_size) + { + preamble (32 bytes) + zero padding (48 bytes) + } ^ AES-128-GCM cipher + auth tag (16 bytes) + { + segment2 payload (70 bytes) + zero padding (10 bytes) + epilogue (16 bytes) + } ^ AES-128-GCM cipher + auth tag (16 bytes) - - The padding is some number of bytes < the auth block_size that - brings the total length of the payload + payload_padding to a - multiple of block_size. It does not include the frame_len or tag. Padding - content can be zeros or (better) random bytes. +* A 20+70+0+350 frame (first segment fully inlined):: -* If FLAG_SIGNED has been specified:: + { + preamble (32 bytes) + segment1 payload (20 bytes) + zero padding (28 bytes) + } ^ AES-128-GCM cipher + auth tag (16 bytes) + { + segment2 payload (70 bytes) + zero padding (10 bytes) + segment4 payload (350 bytes) + zero padding (2 bytes) + epilogue (16 bytes) + } ^ AES-128-GCM cipher + auth tag (16 bytes) - frame_len - tag - payload - payload_padding (out to auth block_size) - signature (sig_size bytes) +* A 105+0+0+0 frame (first segment partially inlined, no epilogue):: - Here the padding just makes life easier for the signature. It can be - random data to add additional confounder. Note also that the - signature input must include some state from the session key and the - previous message. + { + preamble (32 bytes) + segment1 payload (48 bytes) + } ^ AES-128-GCM cipher + auth tag (16 bytes) + { + segment1 payload remainder (57 bytes) + zero padding (7 bytes) + } ^ AES-128-GCM cipher + auth tag (16 bytes) -* If FLAG_ENCRYPTED has been specified:: +* A 105+70+0+350 frame (first segment partially inlined):: - frame_len - tag { - payload - payload_padding (out to auth block_size) - } ^ stream cipher + preamble (32 bytes) + segment1 payload (48 bytes) + } ^ AES-128-GCM cipher + auth tag (16 bytes) + { + segment1 payload remainder (57 bytes) + zero padding (7 bytes) + } ^ AES-128-GCM cipher + auth tag (16 bytes) + { + segment2 payload (70 bytes) + zero padding (10 bytes) + segment4 payload (350 bytes) + zero padding (2 bytes) + epilogue (16 bytes) + } ^ AES-128-GCM cipher + auth tag (16 bytes) + +where epilogue is:: - Note that the padding ensures that the total frame is a multiple of - the auth method's block_size so that the message can be sent out over - the wire without waiting for the next frame in the stream. + __u8 late_status + zero padding (15 bytes) +late_status has the same meaning as in msgr2.1-crc mode. Message flow handshake ---------------------- -- 2.39.5