Since then, we've gotten feedback from users in the field testing
compression with extremely positive results. Clyso has also worked with
a customer that has a large RGW deployment that has seen extremely positive
results.
Advantages of using compression
===============================
1) Significantly lower write amplification and space amplifcation.
In the article above, we saw a 4X reduction in space usage in RocksDB when
writing very small (4KB) objects to RGW. On a real production cluster with
1.3 billion objects, Clyso observed a space usage reduction closer to 2.2X
which was still a substantial improvement. This win is important in
multiple cluster configurations:
1A) Pure HDD
Pure HDD clusters are often seek limited under load. This directly impacts
how quickly RocksDB can write data out, which can increase compaction times.
1B) Hybrid Clusters (HDD Block + Flash DB/WAL)
In this configuration, spillover to the HDD can become a concern when
there isn't enough space on the flash devices to hold all RocksDB
SST files for all of the assoicated OSDs on flash. Compression has
dramatic effect on being able to store all SST files in flash and avoid
spillover.
1C) Pure Flash based clusters
A primary concern for pure flash based clusters is write-amplificaiton
and eventual wear out of the flash under write-intensive scenarios.
RocksDB compression not only reduces space-amplification but also
write-amplification. That means lower wear on the flash cells and
longer flash life.
2) Reduced Compaction Times
The customer cluster that Clyso worked with utilized an HDD-only
configuration. Prior to utilizing RocksDB Compaction, this cluster
could take up to several days to complete a manual compaction of a given
OSD during live operation. Enabling LZ4 compression in RocksDB reduced
manual compaction time to closer to 25-30 minutes, with ~2 hours being
the longest manual compaction time observed.
Potential Disadvantages of RocksDB comppression
===============================================
1) Increased CPU usage
While there is CPU usage overhead associated with utilizing compression,
the effect appeared to be negligable, even on an NVMe backed cluster.
Despite restricting NVMe OSDs to 2 cores so that they were extremely
CPU bound during PUT operations, enabling compression had no notable
effect on PUT performance.
2) Lower GET throughput on NVMe
We noticed a very slight performance hit for GETs on NVMe backed
clusters during GET operations, though the effect was primarily observed
when using Snappy compression and not LZ4 compression. LZ4 GET
performance was very close to performance with RocksDB uncompressed.
3) Other performance impact
Potential other concerns might include lower performance during
iteration or other actions, however I expect this to be unlikely.
RocksDB typically performs best when it can read data from SST files in
large chunks and then work from the block cache. Large readahead values
tend to be a win, either to read data into the block cache or so that
data can be read quickly from the kernel page cache. As far as I can
tell, compression is not having a negative impact here and in fact may be
helping in cases where the disk is already quite busy. In general, we
are already completely dependent on our own in-memory caches for things like
bluestore onodes to achieve high performance on NVMe backed OSDs.
More importantly, the goal on 16.2.13+ should be to reduce the overehad
of iterating over tombstones, and our primary method to do this right
now is to issue compactions on iteration when too many tombstones are
encountered. Reducing the impact of compaction directly benefits this
goal.
Why LZ4 Compression?
Snappy and LZ4 compression are both potential default options. Ceph
previously had a bug related to LZ4 compression that could corrupt data,
so on the surface it might be tempting to default to using snappy
compression. There are several reasons why I believe we should use LZ4
compression by default however.
1) The LZ4 bug is fixed, and there have been no reports of issues since
the fix was put in place.
2) The Google developers have made changes to Snappy's build system that
impacts Ceph. Many distributions are working around these changes, but
the Google developers have explicitly stated that they plan to only
support google specific use cases:
"We are unlikely to accept contributions to the build configuration
files, such as CMakeLists.txt. We are focused on maintaining a build
configuration that allows us to test that the project works in a few
supported configurations inside Google. We are not currently interested
in supporting other requirements, such as different operating systems,
compilers, or build systems."
3) LZ4 compression showed less of a performance impact during RGW 4KB
object gets versus Snappy. Snappy showed no performance gains vs LZ4 in
any of the other tests nor did it appear to show a meaningful
compression advantage.
Impact on existing clusters
===========================
Enabling/Disabling compression in RocksDB will require an OSD restart,
but otherwise does not require user action. SST files will gradually be
compressed over time as part of the compaction process. A manual
compaction can be issued to help accelerate this process. The same goes
if users would like to disable compression. New uncompressed SST files
will be written over time as part of the compaction process, and a
manual compaction can be issued to accelerate this process.
Conclusion
==========
In general, enabling RocksDB compression in bluestore appears to be a
dramatic win. I would like to make this our default behavior for Squid
going forward assuming no issues are uncovered during teuthology testing.
Nizamudeen A [Tue, 16 Jan 2024 05:21:56 +0000 (10:51 +0530)]
admin/doc-requirements: bump Sphinx to 5.0.2
```
Running Sphinx v4.5.0
Sphinx version error:
The sphinxcontrib.applehelp extension used by this project needs at least Sphinx v5.0; it therefore cannot be built with this version.
```
Casey Bodley [Mon, 8 Jan 2024 16:24:18 +0000 (08:24 -0800)]
make-dist: don't use --continue option for wget
the boost jfrog mirror is broken and returns an HTML error page instead
of the archive. the file size of this page is 11534 bytes
when download_from() retries the download from download.ceph.com, the -c
option tells it to resume the download of the existing file. the
resulting boost_1_82_0.tar.bz2 ends up with the correct total file size
of 121325129 bytes, but the first 11534 bytes still correspond to the
HTML from jfrog. that causes the sha256sum mismatch
remove the -c option so that wget fetches the archive in its entirety
Zac Dover [Wed, 3 Jan 2024 08:41:51 +0000 (18:41 +1000)]
doc/radosgw: edit "Add/Remove a Key"
Edit the section "Add/Remove a Key" in doc/radosgw/admin.rst. Each
operation (e.g. "Adding an S3 key pair for a user", "Removing an S3 key
pair for a user") now has its own subsection. This increased granularity
should make it easier in the future to link to each of these specific
operations, if needed.
Co-authored-by: Anthony D'Atri <anthony.datri@gmail.com> Signed-off-by: Zac Dover <zac.dover@proton.me>
(cherry picked from commit f62e93cbe73cd8f624a6c99497051c1a0aaf3ab6)
Zac Dover [Sun, 31 Dec 2023 06:22:33 +0000 (16:22 +1000)]
doc/radosgw: edit "remove a subuser"
Edit the English language in the section "Remove a Subuser" in
doc/radosgw/admin.rst. This commit is made in response to Matt
Benjamin's request for improvement of this section
(https://github.com/ceph/ceph/pull/55028#discussion_r1438599833).
Zac Dover [Wed, 20 Dec 2023 05:00:38 +0000 (15:00 +1000)]
doc/radosgw: edit compression.rst
Improve the grammar and simplify the sentence structure of
doc/radosgw/compression.rst. This commit is made in anticipation of a
near-future commit that will list the compression algorithms available
to users of Ceph.
Co-authored-by: Anthony D'Atri <anthony.datri@gmail.com> Signed-off-by: Zac Dover <zac.dover@proton.me>
(cherry picked from commit 84c5d2c828c2fbd70bdeadedd341ca42ddb1c20c)
Zac Dover [Tue, 19 Dec 2023 09:15:57 +0000 (19:15 +1000)]
doc/install: update "update submodules"
Remove misleading material that would give readers the wrong idea about
when stale submodules are present. This commit is made in response to
information given to me by Ilya Dryomov here: https://github.com/ceph/ceph/pull/54929#issuecomment-1859237986.
The client incorrectly decodes max_xattr_size (type: uint64_t) into
bal_rank_mask (type: string).
This situation ended up due to a couple of reasons:
* the kclient patchset hanlding `max_xattr_size` was merged early on
and another MDS side change that bumped the MDSMap encoding version
to 17 got merged in the midst (PR #43284). Details in comment:
Ilya Dryomov [Fri, 1 Dec 2023 17:29:12 +0000 (18:29 +0100)]
test/librbd: drop DiffIterateTest.DiffIterateRegression6926
This was added to test [1]. It's duplicated by several cases in
DiffIterateTest.DiffIterateDeterministicPP now. Specifically, the
issue could be reproduced by any of:
(8) beginning of time -> snap2
(9) snap1 -> snap2
(10) beginning of time -> snap1
Ilya Dryomov [Fri, 1 Dec 2023 17:54:19 +0000 (18:54 +0100)]
test/librbd: drop TestLibRBD.SnapDiff
This was added to integration test [1], separate from the fix which
went in only with unit test adjustments. It's duplicated by several
cases in DiffIterateTest.DiffIterateDeterministic now. Specifically,
the issue could be reproduced by any of:
(3) snap2 -> HEAD
(4) snap3 -> HEAD
(7) snap2 -> snap3
scribble()-based DiffIterate tests are too weak: at least two
regressions that should been caught by DiffIterate.DiffIterate or
DiffIterate.DiffIterateStress were missed [1][2]. Aside from the
randomness which can be both a good and a bad thing, asserts there
ensure only that the returned diff covers all changes that were made.
If the returned diff is too excessive or otherwise bogus, this isn't
detected [3].
Add a deterministic test to systematically cover the most common cases
that don't involve discards. A similar test for discards will be added
with the fix for [4].
Comment out debug log in vector_iterate_cb() like it's done in
iterate_cb().
Ilya Dryomov [Mon, 27 Nov 2023 10:59:26 +0000 (11:59 +0100)]
librbd: fix read_whole_object handling in ObjectListSnapsRequest
Originally, in commit 2be4840afd4f ("librados/snap_set_diff: don't
assert on empty snapset"), exists was set to true. This didn't make
ObjectListSnapsRequest, causing the following deep-copy tests to fail
when run against calc_snap_set_diff() rigged to return "whole object"
as described in [1]:
This is a regression introduced in commit cc87a8bd697e ("librbd:
deep-copy object utilizes image-extent IO methods") by way of commit 11923e234efc ("librbd: generic object list snapshot request").
Ilya Dryomov [Mon, 27 Nov 2023 09:11:52 +0000 (10:11 +0100)]
librbd: fix LIST_SNAPS_FLAG_WHOLE_OBJECT behavior
Bundling read_whole_object and LIST_SNAPS_FLAG_WHOLE_OBJECT cases
together is wrong:
- In read_whole_object case, calc_snap_set_diff() sets just
read_whole_object. Everything else is zeroed out and may require
resetting to fit with the rest of ObjectListSnapsRequest logic.
- In LIST_SNAPS_FLAG_WHOLE_OBJECT case, only the diff should be
expanded. Everything else is set by calc_snap_set_diff() and should
be used as is. This goes for end_size in particular -- if it's reset
to object size, bogus zero extents may be returned as the object
would appear to have grown.
This is a regression introduced in commit 4429ed4f3f4c ("librbd: switch
diff iterate API to use new snaps list dispatch methods") by way of
commit 66dd53d9c4d9 ("librbd: optionally return full object extent for
any snapshot deltas").
Ilya Dryomov [Sun, 19 Nov 2023 21:44:28 +0000 (22:44 +0100)]
test/librbd: make ListSnapsWholeObject actually test stuff
Despite being added in commit 66dd53d9c4d9 ("librbd: optionally return
full object extent for any snapshot deltas") ostensibly to test the new
LIST_SNAPS_FLAG_WHOLE_OBJECT code, it surely doesn't do that because
the flag isn't even passed to MockObjectListSnapsRequest::create().
I can only guess, but it looks like snap ID 3 was intended to be
a starting point. Otherwise, with 0 and CEPH_NOSNAP passed as snap
IDs, the overlap that is set up for the clone wouldn't affect the
computation in any way.
Use snap ID 3 as a starting point and run both with and without
LIST_SNAPS_FLAG_WHOLE_OBJECT on the same snapset to pinpoint the
difference.
Ilya Dryomov [Sat, 11 Nov 2023 13:15:49 +0000 (14:15 +0100)]
librados/snap_set_diff: set end_size only if end object exists
Since commit 73f50a13109f ("rbd-mirror: use generalized deep copy for
image sync"), the only user of calc_snap_set_diff() immediately unsets
end_size otherwise.
calc_snap_set_diff() semantics are clearer if end_size is set together
with end_exists and clone_end_snap_id.
Ilya Dryomov [Sat, 9 Dec 2023 20:00:42 +0000 (21:00 +0100)]
test/librbd: avoid config-related crashes in DiscardWithPruneWriteOverlap
For reasons that I think no longer apply today, set_val() and
set_val_or_die() refuse to set "type: str" config options that aren't
marked as "can be changed at runtime" -- set_val() returns an error and
set_val_or_die() terminates the process. What is and isn't marked as
"can be changed at runtime" seems to be pretty much random both within
and outside of RBD, so let's just refactor how config is set here.
While at it, I realized that reproducer config is underspecified:
- for rbd_cache_policy and rbd_cache_writethrough_until_flush settings
to matter, rbd_cache must be set to true and rbd_cache_max_dirty must
be set to a positive number
- order should be set explicitly, because rbd_default_order can be as
low as 12 (for 4096-byte objects), interfering with the logic of the
test