Ilya Dryomov [Thu, 12 Dec 2024 20:32:39 +0000 (21:32 +0100)]
librbd/migration/HttpClient: socket isn't shut down on some state transitions
If shut_down() gets delayed until a) the state transition from
STATE_RESET_CONNECTING completes and the reconnect is unsuccessful or
b) the state transition from STATE_RESET_DISCONNECTING completes (i.e.
next_state is STATE_UNINITIALIZED or STATE_RESET_CONNECTING), the
socket needs to be shut down before m_on_shutdown is invoked. The line
of thought here is the same as for the corresponding state transitions
that don't involve STATE_SHUTTING_DOWN.
Ilya Dryomov [Wed, 11 Dec 2024 15:25:13 +0000 (16:25 +0100)]
librbd/migration/HttpClient: avoid hitting an assert in advance_state()
If the shutdown gets delayed until the state transition from
STATE_RESET_CONNECTING completes and the reconnect is successful
(i.e. next_state is STATE_READY), we eventually hit "unexpected
state transition" assert in advance_state(). The reason is that
advance_state() would update m_state and call disconnect() under
STATE_READY instead of STATE_SHUTTING_DOWN. After the disconnect
maybe_finalize_shutdown() would enter advance_state() again with
STATE_SHUTDOWN as next_state, but the transition to that from
STATE_READY is invalid.
Plug this by not transitioning to next_state if current_state is
STATE_SHUTTING_DOWN.
Ilya Dryomov [Mon, 9 Dec 2024 10:19:57 +0000 (11:19 +0100)]
librbd/migration/HttpClient: ignore stream_truncated when shutting down SSL
Propagate ec to handle_disconnect() and use it to suppress
stream_truncated errors. Here is a quote from Beast documentation [1]:
// Gracefully shutdown the SSL/TLS connection
error_code ec;
stream.shutdown(ec);
// Non-compliant servers don't participate in the SSL/TLS shutdown process and
// close the underlying transport layer. This causes the shutdown operation to
// complete with a `stream_truncated` error. One might decide not to log such
// errors as there are many non-compliant servers in the wild.
if(ec != net::ssl::error::stream_truncated)
log(ec);
... and a commit that made ignoring stream_truncated safe [2]:
// ssl::error::stream_truncated, also known as an SSL "short read",
// indicates the peer closed the connection without performing the
// required closing handshake
// [...]
// When a short read would cut off the end of an HTTP message,
// Beast returns the error beast::http::error::partial_message.
// Therefore, if we see a short read here, it has occurred
// after the message has been completed, so it is safe to ignore it.
Ilya Dryomov [Sat, 7 Dec 2024 12:52:41 +0000 (13:52 +0100)]
librbd/migration/HttpClient: drop SslHttpSession::m_ssl_enabled
The remaining callers of disconnect() call it only when m_ssl_enabled
is set to true (i.e. after the handshake is completed):
- shut_down(), in STATE_READY
- maybe_finalize_reset(), very shortly after transitioning out of
STATE_READY as part of performing a reset
- advance_state(), on a transition to STATE_READY that is intercepted
by a previously delayed shut down
m_ssl_enabled isn't used outside of disconnect() and on top of that
is never cleared.
Ilya Dryomov [Sat, 7 Dec 2024 11:22:52 +0000 (12:22 +0100)]
librbd/migration/HttpClient: don't call disconnect() in handle_handshake()
With m_ssl_enabled set to false, disconnect() is a no-op. Since
m_ssl_enabled is flipped to true only when the handshake succeeds,
calling disconnect() on "failed to complete handshake" error is bogus
(as would be attempting to shut down SSL there).
Ilya Dryomov [Fri, 6 Dec 2024 15:51:51 +0000 (16:51 +0100)]
librbd/migration/HttpClient: avoid reusing ssl_stream after shut down
ssl_stream objects can't be reused after shut down: despite
a successful reconnect and handshake, any attempt to read or write
fails with "end of stream" (beast.http:1) or "protocol is shutdown"
(asio.ssl:337690831) error respectively. This doesn't appear to be
documented, but Beast and ASIO authors both mention that the stream
must be destroyed and recreated [1][2].
This was missed because the only integration test with a big enough
image used http instead of https.
Ilya Dryomov [Fri, 6 Dec 2024 13:42:55 +0000 (14:42 +0100)]
librbd/migration/HttpClient: don't shut down socket in resolve_host()
resolve_host() is called from init() and issue() when transitioning out
of STATE_UNINITIALIZED and from advance_state() right after the call to
shutdown_socket(). In all three cases the socket should get closed, so
drop the redundant call and place asserts in connect() implementations
instead.
Zac Dover [Fri, 13 Dec 2024 06:12:49 +0000 (16:12 +1000)]
doc/cephfs: edit 3rd 3rd of mount-using-kernel-driver
Edit the third third of doc/cephfs/mount-using-kernel-driver.rst in
preparation for correcting mount commands that may not work in Reef as
described in this documentation.
This commit edits only English-language strings in
doc/cephfs/mount-using-kernel-driver.rst. No technical content (that is,
no commands and no settings) have been altered in this commit.
Technical alterations to this file will be made only after the English
is unambiguous.
This PR follows the following two PRs:
https://github.com/ceph/ceph/pull/61048 - 1st 3rd
https://github.com/ceph/ceph/pull/61049 - 2nd 3rd
Co-authored-by: Anthony D'Atri <anthony.datri@gmail.com> Signed-off-by: Zac Dover <zac.dover@proton.me>
(cherry picked from commit 9c7580a2935511d009c9e66885e76635aa504ee8)
Zac Dover [Wed, 4 Dec 2024 20:43:12 +0000 (21:43 +0100)]
doc/dev: instruct devs to backport
Add a note to doc/dec/development-workflow.rst that instructs developers
to do their own backports. This change was requested by Laura Flores on
04 Dec 2024.
Co-authored-by: Anthony D'Atri <anthony.datri@gmail.com> Signed-off-by: Zac Dover <zac.dover@proton.me>
(cherry picked from commit 5d584b4badb606d372c266424f59076408f62f40)
Zac Dover [Wed, 11 Dec 2024 14:15:14 +0000 (15:15 +0100)]
doc/cephfs: edit first 3rd of mount-using-kernel-driver
Edit the first third of doc/cephfs/mount-using-kernel-driver.rst in
preparation for correcting mount commands that may not work in Reef as
described in this documentation.
Dan Mick [Thu, 21 Nov 2024 02:18:59 +0000 (18:18 -0800)]
container/Containerfile: purge .repo files with secrets before commit
ceph.repo had creds in it for download.ceph.com/prerelease.
Remove the .repo files we construct, since they're not necessary
once the container is built (no one should be dnf'ing anything
in the container).
Dan Mick [Fri, 1 Nov 2024 02:55:36 +0000 (19:55 -0700)]
container/make-manifest-list.py
- don't print command failure in worker; let the caller print them
if desired (allow silent failure)
- allow for empty tags list
- look for CEPH_SHA1. GIT_COMMIT was the sha1 of the ceph-container.git
commit
- change default paths to prerelease
- add --dry-run to avoid final push
- rename 'HOST' to 'CONTAINER_HOST'
- Use ARCH_SPECIFIC_HOST instead of CONTAINER_HOST (which is used by podman)
Zac Dover [Wed, 4 Dec 2024 02:13:05 +0000 (03:13 +0100)]
doc/rados: fix sentences in health-checks (3 of x)
Make sentences agree at the head of each section in
doc/rados/operations/health-checks.rst. The sentences were sometimes in
the imperative mood and sometimes in the declarative mood.
This commit edits the second third of
doc/rados/operations/health-checks.rst.
Note to (I hope soon) future Zac: There are a a couple of places near
the end of this file where the sentences are ungrammatical. Update these
in a separate PR (in isolation, so that the grammar and technical
accuracy of these sentences can be the primary focus of the reviewers).
Zac Dover [Tue, 3 Dec 2024 11:02:43 +0000 (12:02 +0100)]
doc/rados: fix sentences in health-checks (2 of x)
Make sentences agree at the head of each section in
doc/rados/operations/health-checks.rst. The sentences were sometimes in
the imperative mood and sometimes in the declarative mood.
This commit edits the second third of
doc/rados/operations/health-checks.rst.
Zac Dover [Tue, 3 Dec 2024 08:28:09 +0000 (09:28 +0100)]
doc/rados: make sentences agree in health-checks.rst
Make sentences agree at the head of each section in
doc/rados/operations/health-checks.rst. The sentences were sometimes in
the imperative mood and sometimes in the declarative mood.
This commit edits the first third of
doc/rados/operations/health-checks.rst.
Zac Dover [Sat, 30 Nov 2024 16:50:53 +0000 (17:50 +0100)]
doc/glossary.rst: add "Dashboard Plugin"
Add an entry below the (Mimic-era and therefore outdated but
nonetheless historically important) Dashboard Plugin key word in the
glosssary, which before now had never been added to the glossary.
Zac Dover [Fri, 29 Nov 2024 03:12:02 +0000 (13:12 +1000)]
doc/radosgw: update rgw_dns_name doc
Update doc/radosgw/s3/commons.rst with the changes made by Jiffin Tony
Thottan in https://github.com/ceph/ceph/pull/54524 and the suggestions
made in that same PR by Anthony D'Atri.
Explain how to set rgw_dns_name to a domain name in order to configure
access to virtual hosted buckets.
sajibreadd [Mon, 27 May 2024 07:30:06 +0000 (13:30 +0600)]
Warning added for slow operations and stalled read in BlueStore. User can control how much time the warning should persist after last occurence and maximum number of operations as a threshold will be considered for the warning.
common/options: Change HDD OSD shard configuration defaults for mClock
Based on tests performed at scale on a HDD based cluster, it was found
that scheduling with mClock was not optimal with multiple OSD shards. For
e.g., in the scaled cluster with multiple OSD node failures, the client
throughput was found to be inconsistent across test runs coupled with
multiple reported slow requests.
However, the same test with a single OSD shard and with multiple worker
threads yielded significantly better results in terms of consistency of
client and recovery throughput across multiple test runs.
For more details see https://tracker.ceph.com/issues/66289.
Therefore, as an interim measure until the issue with multiple OSD shards
(or multiple mClock queues per OSD) is investigated and fixed, the
following change to the default HDD OSD shard configuration is made:
The other changes in this commit include:
- Doc change to the OSD and mClock config reference describing
this change.
- OSD troubleshooting entry on the procedure to change the shard
configuration for clusters affected by this issue running on older
releases.
- Add release note for this change.
Conflicts:
doc/rados/troubleshooting/troubleshooting-osd.rst
- Included the troubleshooting entry before the "Flapping OSDs" section.
PendingReleaseNotes
- Moved the release note under 18.2.4 section and removed unrelated entries
Zac Dover [Sat, 23 Nov 2024 12:32:13 +0000 (22:32 +1000)]
doc/cephadm: Clarify "Deploying a new Cluster"
Change the title of the section "Deploying a new Ceph cluster" to "Using
cephadm to Deploy a New Ceph Cluster". This is part of the initiative to
separate package-related documentation from container-based
documenation.
Zac Dover [Mon, 11 Nov 2024 23:31:28 +0000 (09:31 +1000)]
doc/rados: correct "full ratio" note
Correct a note that directed users not to add an OSD after the cluster
has reached its "full ratio". The note now says "Do not let your cluster
reach its full ratio before adding an OSD."
Hat tip: Oskar Berggren
Fixes: https://tracker.ceph.com/issues/68900 Co-authored-by: Oskar Berggren <oskar.berggren@gmail.com> Signed-off-by: Zac Dover <zac.dover@proton.me>
(cherry picked from commit f1a2637c79a15c26a769661dd72ca68d766b2f0d)