Ilya Dryomov [Mon, 22 Mar 2021 18:16:32 +0000 (19:16 +0100)]
auth/cephx: rotate auth tickets less often
If unauthorized global_id (re)use is disallowed, a client that has
been disconnected from the network long enough for keys to rotate
and its auth ticket to expire (i.e. become invalid/unverifiable)
would not be able to reconnect.
The default TTL is 12 hours, resulting in a 12-24 hour reconnect
window (the previous key is kept around, so the actual window can be
up to double the TTL). The setting has stayed the same since 2009,
but it also hasn't been enforced. Bump it to get a 72 hour reconnect
window to cover for something breaking on Friday and not getting fixed
until Monday.
Ilya Dryomov [Thu, 25 Mar 2021 19:59:13 +0000 (20:59 +0100)]
mon: fail fast when unauthorized global_id (re)use is disallowed
When unauthorized global_id (re)use is disallowed, we don't want to
let unpatched clients in because they wouldn't be able to reestablish
their monitor session later, resulting in subtle hangs and disrupted
user workloads.
Denying the initial connect for all legacy (CephXAuthenticate < v3)
clients is not feasible because a large subset of them never stopped
presenting their ticket on reconnects and are therefore compatible with
enforcing mode: most notably all kernel clients but also pre-luminous
userspace clients. They don't need to be patched and excluding them
would significantly hamper the adoption of enforcing mode.
Instead, force clients that we are not sure about to reconnect shortly
after they go through authentication and obtain global_id. This is
done in Monitor::dispatch_op() to capture both msgr1 and msgr2, most
likely instead of dispatching mon_subscribe.
We need to let mon_getmap through for "ceph ping" and "ceph tell" to
work. This does mean that we share the monmap, which lets the client
return from MonClient::authenticate() considering authentication to be
finished and causing the potential reconnect error to not propagate to
the user -- the client would hang waiting for remaining cluster maps.
For msgr1, this is unavoidable because the monmap is sent immediately
after the final MAuthReply. But for msgr2 this is rare: most of the
time we get to their mon_subscribe and cut the connection before they
process the monmap!
Regardless, the user doesn't get a chance to start a workload since
there is no proper higher-level session at that point.
To help with identifying clients that need patching, add global_id and
global_id_status to "sessions" output.
Ilya Dryomov [Sat, 13 Mar 2021 13:53:52 +0000 (14:53 +0100)]
auth/cephx: option to disallow unauthorized global_id (re)use
global_id is a cluster-wide unique id that must remain stable for the
lifetime of the client instance. The cephx protocol has a facility to
allow clients to preserve their global_id across reconnects:
(1) the client should provide its global_id in the initial handshake
message/frame and later include its auth ticket proving previous
possession of that global_id in CEPHX_GET_AUTH_SESSION_KEY request
(2) the monitor should verify that the included auth ticket is valid
and has the same global_id and, if so, allow the reclaim
(3) if the reclaim is allowed, the new auth ticket should be
encrypted with the session key of the included auth ticket to
ensure authenticity of the client performing reclaim. (The
included auth ticket could have been snooped when the monitor
originally shared it with the client or any time the client
provided it back to the monitor as part of requesting service
tickets, but only the genuine client would have its session key
and be able to decrypt.)
Unfortunately, all (1), (2) and (3) have been broken for a while:
- (1) was broken in 2016 by commit a2eb6ae3fb57 ("mon/monclient:
hunt for multiple monitor in parallel") and is addressed in patch
"mon/MonClient: preserve auth state on reconnects"
- it turns out that (2) has never been enforced. When cephx was
being designed and implemented in 2009, two changes to the protocol
raced with each other pulling it in different directions: commits 0669ca21f4f7 ("auth: reuse global_id when requesting tickets")
and fec31964a12b ("auth: when renewing session, encrypt ticket")
added the reclaim mechanism based strictly on auth tickets, while
commit 5eeb711b6b2b ("auth: change server side negotiation a bit")
allowed the client to provide global_id in the initial handshake.
These changes didn't get reconciled and as a result a malicious
client can assign itself any global_id of its choosing by simply
passing something other than 0 in MAuth message or AUTH_REQUEST
frame and not even bother supplying any ticket. This includes
getting a global_id that is being used by another client.
- (3) was broken in 2019 with addition of support for msgr2, where
the new auth ticket ends up being shared unencrypted. However the
root cause is deeper and a malicious client can coerce msgr1 into
the same. This also goes back to 2009 and is addressed in patch
"auth/cephx: ignore CEPH_ENTITY_TYPE_AUTH in requested keys".
Because (2) has never been enforced, no one noticed when (1) got
broken and we began to rely on this flaw for normal operation in
the face of reconnects due to network hiccups or otherwise. As of
today, only pre-luminous userspace clients and kernel clients are
not exercising it on a daily basis.
Bump CephXAuthenticate version and use a dummy v3 to distinguish
between legacy clients that don't (may not) include their auth ticket
and new clients. For new clients, unconditionally disallow claiming
global_id without a corresponding auth ticket. For legacy clients,
introduce a choice between permissive (current behavior, default for
the foreseeable future) and enforcing mode.
If the reclaim is disallowed, return EACCES. While MonClient does
have some provision for global_id changes and we could conceivably
implement enforcement by handing out a fresh global_id instead of
the provided one, those code paths have never been tested and there
are too many ways a sudden global_id change could go wrong.
Ilya Dryomov [Tue, 9 Mar 2021 15:33:55 +0000 (16:33 +0100)]
auth/AuthServiceHandler: keep track of global_id and whether it is new
AuthServiceHandler already has global_id field, but it is unused.
Revive it and let the handler know whether global_id is newly assigned
by the monitor or provided by the client.
Lift the setting of entity_name into AuthServiceHandler.
Conflicts:
src/mon/MonClient.cc [ commit 1e9b18008c5e ("mon: set
MonClient::_add_conn return type to void") not in nautilus ]
src/mon/MonClient.h [ ditto ]
Destroying AuthClientHandler and not resetting global_id is another
way to get MonClient to send CEPHX_GET_AUTH_SESSION_KEY requests with
CephXAuthenticate::old_ticket not populated. This is particularly
pertinent to get_monmap_and_config() which shuts down the bootstrap
MonClient between retry attempts.
Ilya Dryomov [Mon, 8 Mar 2021 14:37:02 +0000 (15:37 +0100)]
mon/MonClient: preserve auth state on reconnects
Commit a2eb6ae3fb57 ("mon/monclient: hunt for multiple monitor in
parallel") introduced a regression where auth state (global_id and
AuthClientHandler) was no longer preserved on reconnects. The ensuing
breakage was quickly noticed and prompted a follow-on fix 8bb6193c8f53
("mon/MonClient: persist global_id across re-connecting").
However, as evident from the subject, the follow-on fix only took
care of the global_id part. AuthClientHandler is still destroyed
and all cephx tickets are discarded. A new from-scratch instance
is created for each MonConnection and CEPHX_GET_AUTH_SESSION_KEY
requests end up with CephXAuthenticate::old_ticket not populated.
The bug is in MonClient, so both msgr1 and msgr2 are affected.
This should have resulted in a similar sort of breakage but didn't
because of a much larger bug. The monitor should have denied the
attempt to reclaim global_id with no valid ticket proving previous
possession of that global_id presented. Alas, it appears that this
aspect of the cephx protocol has never been enforced. This is dealt
with in the next patch.
To fix the issue at hand, clone AuthClientHandler into each
MonConnection so that each respective CEPHX_GET_AUTH_SESSION_KEY
request gets a copy of the current auth ticket.
Ilya Dryomov [Sat, 6 Mar 2021 10:15:40 +0000 (11:15 +0100)]
mon/MonClient: claim active_con's auth explicitly
Eliminate confusion by moving auth from active_con into MonClient
instead of swapping them.
The existing MonClient::auth can be destroyed right away -- I don't
see why active_con would need it or a reason to delay its destruction
(which is what stashing in active_con effectively does).
Kefu Chai [Sat, 6 Mar 2021 16:32:42 +0000 (00:32 +0800)]
.github: correct the regex in mileston workflow
also use pull_request_target event so the action is run in the
context of the base of the pull request. this helps us to overcome
the "Resource not accessible by integration" issue where the action
is run in the context of the pull request.
Igor Fedotov [Fri, 26 Feb 2021 14:16:11 +0000 (17:16 +0300)]
os/bluestore: go beyond pinned onodes while trimming the cache.
One might face lack of cache trimming when there is a bunch of pinned entries on the top of Onode's cache LRU list. If these pinned entries stay in the state for a long time cache might start using too much memory causing OSD to go out of osd-memory-target limit. Pinned state tend to happen to osdmap onodes.
The proposed patch preserves last trim position in the LRU list (if it pointed to a pinned entry) and proceeds trimming from that position if it wasn't invalidated. LRU nature of the list enables to do that safely since no new entries appear above the previously present entry while it's not touched.
Fixes: https://tracker.ceph.com/issues/48729 Signed-off-by: Igor Fedotov <ifedotov@suse.com>
Adam Kupczyk [Sat, 30 Jan 2021 11:57:05 +0000 (12:57 +0100)]
os/bluestore: Add option to check BlueFS reads
Add option "bluefs_check_for_zeros" to check if there are any zero-filled page.
If so, reread data. It is known that sometimes BlueStore gets such pages.
See "bluestore_retry_disk_reads".
- docstring added to describe the link to mgr/prometheus conflicted with the
const fmt definition for the message. resolved by adding doc under the const
definition.
Paul Cuzner [Thu, 8 Oct 2020 03:30:56 +0000 (16:30 +1300)]
mgr/prometheus: Add healthcheck metric for SLOW_OPS
SLOW_OPS is triggered by op tracker, and generates a health
alert but healthchecks do not create metrics for prometheus to
use as alert triggers. This change adds SLOW_OPS metric, and
provides a simple means to extend to other relevant health
checks in the future
If the extract of the value from the health check message fails
we log an error and remove the metric from the metric set. In
addition the metric description has changed to better reflect
the scenarios where SLOW_OPS can be triggered.
Nathan Cutler [Thu, 25 Feb 2021 20:50:20 +0000 (21:50 +0100)]
common/mempool: include standard thread library
Attempt to address FTBFS:
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/test_mempool.cc:399:11: error: request for member 'clear' in 'workers', which is of non-class type 'int'
399 | workers.clear();
| ^~~~~
Igor Fedotov [Fri, 5 Feb 2021 11:03:48 +0000 (14:03 +0300)]
os/bluestore: fix huge(>4GB) writes from RocksDB to BlueFS.
Fixes: https://tracker.ceph.com/issues/49168 Signed-off-by: Igor Fedotov <ifedotov@suse.com>
(cherry picked from commit 5f94883ec8d64c02b2bb499caad8eaf91dd715f7)
Conflicts:
(lack of bufferlist refactor from https://github.com/ceph/ceph/pull/36754)
(lack of single allocator support from https://github.com/ceph/ceph/pull/30838)
src/os/bluestore/BlueFS.h
src/test/objectstore/test_bluefs.cc
Jianpeng Ma [Mon, 10 Aug 2020 07:56:13 +0000 (15:56 +0800)]
os/bluestore/BlueRocksEnv: Avoid flushing too much data at once.
Although, in _flush func we already check length. If length of dirty
is less then bluefs_min_flush_size, we will skip this flush.
But in fact, we found rocksdb can call many times Append() and then
call Flush(). This make flush_data is much larger than
bluefs_min_flush_size.
From my test, w/o this patch, it can reduce 99.99% latency(from
145.753ms to 20.474ms) for 4k randwrite with bluefs_buffered_io=true.
Because Bluefs::flush acquire lock. So we add new api try_flush
to avoid lock contention.
Kotresh HR [Fri, 19 Feb 2021 11:27:23 +0000 (16:57 +0530)]
mgr/volumes: Bump up AuthMetadataManager's version
With ceph_volume_client and mgr-volumes co-existing
for sometime, the version of both needs to be same.
The ceph_volume_client version <=5 can't decode
'subvolumes' key in auth-metadata file. Hence to
handle version in-compatibility, the version of
ceph_volume_client is bumped up to 6 and the same
needs to be done in mgr-volume's AuthMetadataManager
Kotresh HR [Fri, 19 Feb 2021 11:12:33 +0000 (16:42 +0530)]
pybind/ceph_volume_client: Bump up the version and compat_version to 6
With 'volumes' key updated to 'subvolumes', the version of
ceph_volume_client <= 5 can't decode auth-metadata file. Hence
bumping up ceph_volume_client version and compat_version to 6.
Kotresh HR [Mon, 15 Feb 2021 16:26:51 +0000 (21:56 +0530)]
pybind/ceph_volume_client: Update the 'volumes' key to 'subvolumes' in auth metadata file
The older auth metadata files before nautilus release stores
the authorized subvolumes using the 'volumes' key. As the
notion of 'subvolumes' brought in by mgr/volumes, it makes
sense to use 'subvolumes' key. This patch would be tranparently
update 'volumes' key to 'subvolumes' and newer auth metadata
files would store them with 'subvolumes' key.
Also fails the deauthorize if the auth-id doesn't exist.
Matthew Vernon [Thu, 4 Feb 2021 11:41:14 +0000 (11:41 +0000)]
rgw/radosgw-admin clarify error when email address already in use
The error message if you try and create an S3 user with an email
address that is already associated with another S3 account is very
confusing; this patch makes it much clearer
To reproduce:
radosgw-admin user create --uid=foo --display-name="Foo test" --email=bar@domain.invalid
radosgw-admin user create --uid=test --display-name="AN test" --email=bar@domain.invalid
could not create user: unable to parse parameters, user id mismatch, operation id: foo does not match: test
With this patch:
radosgw-admin user create --uid=test --display-name="AN test" --email=bar@domain.invalid
could not create user: unable to create user test because user id foo already exists with email bar@domain.invalid
Fixes: https://tracker.ceph.com/issues/49137 Fixes: https://tracker.ceph.com/issues/19411 Signed-off-by: Matthew Vernon <mv3@sanger.ac.uk>
(cherry picked from commit 05318d6f71e45a42a46518a0ef17047dfab83990)
It appears that commit 6eb8f30a238 broke the test utility and
its failure was masked by the test case that expected a failure
due to a timeout force-killing the app.
Fixes: https://tracker.ceph.com/issues/49117 Signed-off-by: Jason Dillaman <dillaman@redhat.com>
(cherry picked from commit 8643b046fb4d5b05b4c75b83f16cd8ccc6a8b0a0)
Conflicts:
qa/workunits/rbd/rbd_mirror_helpers.sh
- no show_diff function in nautilus
qa/tasks/ceph_manager: use s/ByteIO/StringIO in stdout for ceph-objectstore-tool
wrt master, we have moved to using run_ceph_objectstore_tool which uses
StringIO for stdout and stderr, to make the changes compatible with
nautilus, replacing use of ByteIO with StringIO.