Since then, we've gotten feedback from users in the field testing
compression with extremely positive results. Clyso has also worked with
a customer that has a large RGW deployment that has seen extremely positive
results.
Advantages of using compression
===============================
1) Significantly lower write amplification and space amplifcation.
In the article above, we saw a 4X reduction in space usage in RocksDB when
writing very small (4KB) objects to RGW. On a real production cluster with
1.3 billion objects, Clyso observed a space usage reduction closer to 2.2X
which was still a substantial improvement. This win is important in
multiple cluster configurations:
1A) Pure HDD
Pure HDD clusters are often seek limited under load. This directly impacts
how quickly RocksDB can write data out, which can increase compaction times.
1B) Hybrid Clusters (HDD Block + Flash DB/WAL)
In this configuration, spillover to the HDD can become a concern when
there isn't enough space on the flash devices to hold all RocksDB
SST files for all of the assoicated OSDs on flash. Compression has
dramatic effect on being able to store all SST files in flash and avoid
spillover.
1C) Pure Flash based clusters
A primary concern for pure flash based clusters is write-amplificaiton
and eventual wear out of the flash under write-intensive scenarios.
RocksDB compression not only reduces space-amplification but also
write-amplification. That means lower wear on the flash cells and
longer flash life.
2) Reduced Compaction Times
The customer cluster that Clyso worked with utilized an HDD-only
configuration. Prior to utilizing RocksDB Compaction, this cluster
could take up to several days to complete a manual compaction of a given
OSD during live operation. Enabling LZ4 compression in RocksDB reduced
manual compaction time to closer to 25-30 minutes, with ~2 hours being
the longest manual compaction time observed.
Potential Disadvantages of RocksDB comppression
===============================================
1) Increased CPU usage
While there is CPU usage overhead associated with utilizing compression,
the effect appeared to be negligable, even on an NVMe backed cluster.
Despite restricting NVMe OSDs to 2 cores so that they were extremely
CPU bound during PUT operations, enabling compression had no notable
effect on PUT performance.
2) Lower GET throughput on NVMe
We noticed a very slight performance hit for GETs on NVMe backed
clusters during GET operations, though the effect was primarily observed
when using Snappy compression and not LZ4 compression. LZ4 GET
performance was very close to performance with RocksDB uncompressed.
3) Other performance impact
Potential other concerns might include lower performance during
iteration or other actions, however I expect this to be unlikely.
RocksDB typically performs best when it can read data from SST files in
large chunks and then work from the block cache. Large readahead values
tend to be a win, either to read data into the block cache or so that
data can be read quickly from the kernel page cache. As far as I can
tell, compression is not having a negative impact here and in fact may be
helping in cases where the disk is already quite busy. In general, we
are already completely dependent on our own in-memory caches for things like
bluestore onodes to achieve high performance on NVMe backed OSDs.
More importantly, the goal on 16.2.13+ should be to reduce the overehad
of iterating over tombstones, and our primary method to do this right
now is to issue compactions on iteration when too many tombstones are
encountered. Reducing the impact of compaction directly benefits this
goal.
Why LZ4 Compression?
Snappy and LZ4 compression are both potential default options. Ceph
previously had a bug related to LZ4 compression that could corrupt data,
so on the surface it might be tempting to default to using snappy
compression. There are several reasons why I believe we should use LZ4
compression by default however.
1) The LZ4 bug is fixed, and there have been no reports of issues since
the fix was put in place.
2) The Google developers have made changes to Snappy's build system that
impacts Ceph. Many distributions are working around these changes, but
the Google developers have explicitly stated that they plan to only
support google specific use cases:
"We are unlikely to accept contributions to the build configuration
files, such as CMakeLists.txt. We are focused on maintaining a build
configuration that allows us to test that the project works in a few
supported configurations inside Google. We are not currently interested
in supporting other requirements, such as different operating systems,
compilers, or build systems."
3) LZ4 compression showed less of a performance impact during RGW 4KB
object gets versus Snappy. Snappy showed no performance gains vs LZ4 in
any of the other tests nor did it appear to show a meaningful
compression advantage.
Impact on existing clusters
===========================
Enabling/Disabling compression in RocksDB will require an OSD restart,
but otherwise does not require user action. SST files will gradually be
compressed over time as part of the compaction process. A manual
compaction can be issued to help accelerate this process. The same goes
if users would like to disable compression. New uncompressed SST files
will be written over time as part of the compaction process, and a
manual compaction can be issued to accelerate this process.
Conclusion
==========
In general, enabling RocksDB compression in bluestore appears to be a
dramatic win. I would like to make this our default behavior for Squid
going forward assuming no issues are uncovered during teuthology testing.
Signed-off-by: Mark Nelson <mark.nelson@clyso.com>
NitzanMordhai [Sun, 25 Jun 2023 08:41:55 +0000 (08:41 +0000)]
ceph-dencoder: Add missing rgw types to ceph-dencoder for accurate encode-decode comparison
Currently, ceph-dencoder lacks certain rgw types, preventing us from accurately checking the ceph corpus for encode-decode mismatches.
This pull request aims to address this issue by adding the missing types to ceph-dencoder.
To successfully incorporate these types into ceph-dencoder, we need to introduce the necessary `dump` and `generate_test_instances`
functions that was missing in some types. These functions are essential for proper encode and decode of the added types.
This PR will enhance the functionality of ceph-dencoder by including the missing types, enabling a comprehensive analysis of encode-decode consistency.
With the addition of these types, we can ensure the robustness and correctness of the ceph corpus.
This update will significantly contribute to improving the overall reliability and accuracy of ceph-dencoder.
It allows for a more comprehensive assessment of the encode-decode behavior, leading to enhanced data integrity and stability within the ceph ecosystem.
Rishabh Dave [Wed, 6 Sep 2023 22:16:39 +0000 (03:46 +0530)]
mon/FSCommands: fix variable names and function names in calls
PR #51942 today and PR #52409 was merged last week. Latter PR changed
"fs" to "fsp" and made mds_map and fscid private and introduced
get_mds_map() and get_fsicd() methods instead.
Combination of former and latter PR introduced compilation errors & made
it impossible to build the binaries on the latest main branch. This
commit attemps to fix this issue.
Fixes: https://tracker.ceph.com/issues/62729 Signed-off-by: Rishabh Dave <ridave@redhat.com>
Rishabh Dave [Sun, 30 Jul 2023 17:27:37 +0000 (22:57 +0530)]
qa/cephfs: CephFSTestCase.create_client() must keyring
Replace call to run_ceph_cmd() by call to get_ceph_cmd_stdout() in
method qa.tasks.cephfs.cephfs_test_case.CephFSTestCase.create_client().
run_ceph_cmd() will not return keyring which is wrong.
get_ceph_cmd_stdout() will return the stdout of "ceph auth add"
command, which is the keyring that is expected to be returned by
CephFSTestCase.create_client().
Fixes: https://tracker.ceph.com/issues/62246 Signed-off-by: Rishabh Dave <ridave@redhat.com>
The Windows build script uses static linking by default, the
reason being that some tests were failing to build otherwise,
mostly due to unspecified dependencies.
Now that the issue was addressed, we can enable dynamic linking
by default.
Worth mentioning that the Ceph MSI build script already uses
dynamic linking.
While at it, we'll drop some duplicate defaults from
"win32_deps_build.sh". For better clarity, we'll avoid exporting
some "win32_build.sh" variables, instead passing them explicitly
to "win32_deps_build.sh".
doc/man: remove docs about support for unix domain sockets
doc/man: support for unix domain sockets is not implemented, hence we
removed documentation about it.
(Note: the changes in this commit were the work of Rok Jaklič in
https://github.com/ceph/ceph/pull/48537. This pull request has been
raised because that pull request was for some mysterious reason causing
merge conflicts that were never resolved.)
Co-authored-by: Rok Jaklič rjaklic@gmail.com Signed-off-by: Zac Dover <zac.dover@proton.me>
Or Ozeri [Thu, 25 Nov 2021 19:03:02 +0000 (21:03 +0200)]
librbd: remove remap_to_* and image crypto layer
This commit removes the crypto image dispatch layer.
Instead, data offset calculation is taken from ImageCtx->encryption_format.
This change makes the remap_to_* api unnecessary, so it is removed.
Rishabh Dave [Tue, 22 Aug 2023 09:53:05 +0000 (15:23 +0530)]
qa/cephfs: fix ior project build failure
1. Re-install mpich packages.
2. Use a more recent version of ior project.
3. Log contents of binary directory for easy debugging.
4. Set "-x" for bash session so commands are also logged.
Fixes: https://tracker.ceph.com/issues/61399 Signed-off-by: Rishabh Dave <ridave@redhat.com>
Nizamudeen A [Wed, 30 Aug 2023 05:20:30 +0000 (10:50 +0530)]
mgr/dashboard: remove green tick on old password field
a green tick is showing to the field where we enter the old password in
login password change form. It starts showing green tick as soon as we
start typing on it. Removing that because its misleads the user.
Fixes: https://tracker.ceph.com/issues/62644 Signed-off-by: Nizamudeen A <nia@redhat.com>
Ali Masarwa [Mon, 4 Sep 2023 12:25:44 +0000 (15:25 +0300)]
Merge pull request #52964 from AliMasarweh/wip-alimasa-persistant-q-enhance
RGW: added a per topic configuration to control the notification persistency
Adding the configuration of persistency per topic that will override the global settings
qa: add "failover / failback loop" test for rbd-mirror
For snapshot-based mirroring, check that demote (or other mirror
snapshots) don't pile up. Nothing in particular to assert on for
journal-based mirroring but the test is still useful.
Ilya Dryomov [Sat, 26 Aug 2023 11:04:52 +0000 (13:04 +0200)]
librbd: make CreatePrimaryRequest remove any unlinked mirror snapshots
After commit ac552c9b4d65 ("librbd: localize snap_remove op for mirror
snapshots"), rbd-mirror daemon no longer removes mirror snapshots when
it's done syncing them -- instead it only unlinks from them. However,
CreatePrimaryRequest state machine was not adjusted to compensate and
hence two cases were missed:
- primary demotion snapshot (rbd-mirror daemon unlinks from primary
demotion snapshots just like it does from regular primary snapshots);
this comes up when an image is demoted but then promoted on the same
cluster
- non-primary demotion snapshot (unlike regular non-primary snapshots,
non-primary demotion snapshots store peer uuids and rbd-mirror daemon
does unlinking just like in the case of primary snapshots); this
comes up when an image is demoted and promoted on the other cluster
Related is the case of orphan snapshots. Since they are dummy to begin
with, CreatePrimaryRequest would now clean up the orphan snapshot after
the creation of the force promote snapshot.
Fixes: https://tracker.ceph.com/issues/61707 Co-authored-by: Christopher Hoffman <choffman@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Tue, 22 Aug 2023 15:27:50 +0000 (17:27 +0200)]
librbd: don't attempt to remove image state on orphan snapshots
Despite being mirror snapshots, orphan snapshots don't have image
state: see CreateNonPrimaryRequest::write_image_state() for a similar
is_orphan() check. Attempting to remove image state generates bogus
"failed to read image state object" and "failed to remove image state"
errors.
Adam Kupczyk [Fri, 11 Aug 2023 14:24:57 +0000 (14:24 +0000)]
tools/variable_load: Add generator of variable workload
The tool is dedicated to create highly variable workloads.
The intended audience is
1) scraper testing - tool to sniff on OSD ops
2) CoDel testing - bandwidth/latency optimization algorithm in BS
Initial commit.
rgw: make error message more friendly on rgw-restore-bucket-index
When the bucket referenced cannot be found remind the user that they
may need to set realm, zonegroup, and/or zone. This improvement was
suggested by Madhavi Kasturi.
Signed-off-by: J. Eric Ivancich <ivancich@redhat.com>
J. Eric Ivancich [Wed, 12 Jul 2023 17:54:07 +0000 (13:54 -0400)]
rgw: allow multisite specification used w/ rgw-restore-bucket-index script
When the metadata for a bucket is requested only the default
realm/zonegroup/zone is currently supported. This adds three new
command-line options to rgw-restore-bucket-index:
The multisite specification will then be used in invocations of
`radosgw-admin`, such as to query the zone, get metadata, and invoke
the "object reindex" subcommand.
Signed-off-by: J. Eric Ivancich <ivancich@redhat.com>
J. Eric Ivancich [Wed, 14 Jun 2023 19:53:19 +0000 (15:53 -0400)]
rgw: enhances rgw-restore-bucket-index script
This enhances the script to both process versioned buckets correctly
and to handle object names that begin with underscore.
If the bucket is versioned it submits each version chronologically
(based on mtime) to be reindexed in order to "replay" the modification
of objects. However mtime is not a perfect indicator. So additionally
it looks at the OLH object to determine the most recent version and
the script makes sure that it is replayed last. The order of previous
versions is likely correct, but not guaranteed to be so.
Additional logic is added to handle objects with names that begin with
underscore ('_') since that's used as a delimiter and needs to be
escaped and rados object locators are also used.
A man page for the script is added.
Signed-off-by: J. Eric Ivancich <ivancich@redhat.com>
John Mulligan [Thu, 17 Aug 2023 19:11:57 +0000 (15:11 -0400)]
cephadm: black format net_utils.py
This contains one small manual change reversing the black formatting.
According to the black docs [1], flake8's E203 is not pep8 compliant but
since flake8 is currently mandatory and black formatting is not we'll
leave it in a form that makes flake8 happy and attempt to resolve the
disagreement some time in the future.
Signed-off-by: John Mulligan <jmulligan@redhat.com>
Pair-programmed-with: Adam King <adking@redhat.com> Co-authored-by: Adam King <adking@redhat.com> Signed-off-by: John Mulligan <jmulligan@redhat.com>
John Mulligan [Wed, 16 Aug 2023 20:44:55 +0000 (16:44 -0400)]
cephadm: black format file_utils.py
Signed-off-by: John Mulligan <jmulligan@redhat.com>
Pair-programmed-with: Adam King <adking@redhat.com> Co-authored-by: Adam King <adking@redhat.com>
Pair-programmed-with: Adam King <adking@redhat.com> Co-authored-by: Adam King <adking@redhat.com>