Jan Fajerski [Wed, 18 Nov 2020 08:37:48 +0000 (09:37 +0100)]
ceph-volume inventory: make libstoragemgmt data retrieval optional
Default to not retrieving libstoragemgmt data since it seems this can
cause serious issues on older hardware. Safest way is to only retrieve
lsm data when the user opts in..
Fixes: https://tracker.ceph.com/issues/48270 Signed-off-by: Jan Fajerski <jfajerski@suse.com>
(cherry picked from commit b29a54d21e314db7a9d681cf5cc089dcfcbf6dc0)
Casey Bodley [Mon, 23 Nov 2020 23:06:26 +0000 (18:06 -0500)]
rgw: temporarily disable calls to defer_gc() in RGWGetObj
cls_rgw_gc_queue_update_entry() is known to cause data loss when called
on objects that have not actually been scheduled for garbage collection
RGWGetObj is the only caller, and uses defer_gc() when reads are taking
a long time compared to rgw_gc_obj_min_wait. if an object has since been
deleted and submitted for garbage collection, this allows RGWGetObj to
defer that gc until the entire read completes
by disabling these calls to defer_gc(), very long reads (longer than 1hr,
with default configuration) may fail if the object gets deleted, and a
retry will result in a 404 Not Found error as expected
J. Eric Ivancich [Sat, 21 Nov 2020 16:10:35 +0000 (11:10 -0500)]
rgw: during GC defer, prevent new GC enqueue
With the new queue-based GC code, when a GC defer operation is
performed, it adds an "urgent" record to prevent GC from removing
objects that are still being read. It does not check whether the
objects are on the GC queue or not and that's OK for the urgent
record.
The code *also* adds a new GC entry to the queue to cause GC to occur
at a later time. This would be incorrect if there was no GC entry to
begin with, however. In such a case this would cause GC to delete tail
objects when no user-initiated remove has happend. In other words a
READ could cause a DELETE of tail objects and therefore data loss.
This fix prevents such a new GC entry from being enqueued, thus
preventing the data loss in this rare case. There is a new risk that
tail object orphans to be created, but as an immediate fix to prevent
data loss, this is appropriate and it is a rare event. A follow-on PR
that will handle these cases is likely.
This PR adds a level 0 log entry as a way to potentially confirm this
case is being triggered in real-world cases. In time, this log entry
should be deleted.
Ilya Dryomov [Fri, 16 Oct 2020 10:57:50 +0000 (12:57 +0200)]
mon/MonClient: bring back CEPHX_V2 authorizer challenges
Commit c58c5754dfd2 ("msg/async/ProtocolV1: use AuthServer and
AuthClient") introduced a backwards compatibility issue into msgr1.
To fix it, commit 321548010578 ("mon/MonClient: skip CEPHX_V2
challenge if client doesn't support it") set out to skip authorizer
challenges for peers that don't support CEPHX_V2. However, it
made it so that authorizer challenges are skipped for all peers in
both msgr1 and msgr2 cases, effectively disabling the protection
against replay attacks that was put in place in commit f80b848d3f83
("auth/cephx: add authorizer challenge", CVE-2018-1128).
This is because con->get_features() always returns 0 at that
point. In msgr1 case, the peer shares its features along with the
authorizer, but while they are available in connect_msg.features they
aren't assigned to con until ProtocolV1::open(). In msgr2 case, the
peer doesn't share its features until much later (in CLIENT_IDENT
frame, i.e. after the authentication phase). The result is that
!CEPHX_V2 branch is taken in all cases and replay attack protection
is lost.
Only clusters with cephx_service_require_version set to 2 on the
service daemons would not be silently downgraded. But, since the
default is 1 and there are no reports of looping on BADAUTHORIZER
faults, I'm pretty sure that no one has ever done that. Note that
cephx_require_version set to 2 would have no effect even though it
is supposed to be stronger than cephx_service_require_version
because MonClient::handle_auth_request() didn't check it.
To fix:
- for msgr1, check connect_msg.features (as was done before commit c58c5754dfd2) and challenge if CEPHX_V2 is supported. Together
with two preceding patches that resurrect proper cephx_* option
handling in msgr1, this covers both "I want old clients to work"
and "I wish to require better authentication" use cases.
- for msgr2, don't check anything and always challenge. CEPHX_V2
predates msgr2, anyone speaking msgr2 must support it.
This was added in commit 9bcbc2a3621f ("mon,msg: implement
cephx_*_require_version options") and inadvertently dropped in
commit e6f043f7d2dc ("msgr/async: huge refactoring of protocol V1").
As a result, service daemons don't enforce cephx_require_version
and cephx_cluster_require_version options and connections without
CEPH_FEATURE_CEPHX_V2 are allowed through.
(cephx_service_require_version enforcement was brought back a
year later in commit 321548010578 ("mon/MonClient: skip CEPHX_V2
challenge if client doesn't support it"), although the peer gets
TAG_BADAUTHORIZER instead of TAG_FEATURES.)
Resurrect the original behaviour: all cephx_*require_version
options are enforced and the peer gets TAG_FEATURES, signifying
that it is missing a required feature.
Ilya Dryomov [Fri, 16 Oct 2020 09:33:32 +0000 (11:33 +0200)]
msg/async/ProtocolV1: resurrect "include MGR as service when applying cephx settings"
This was added in commit 0ec7d6bbc4af ("msg/async,simple: include MGR
as service when applying cephx settings") and inadvertently dropped in
commit e6f043f7d2dc ("msgr/async: huge refactoring of protocol V1").
As a result, mgr daemons are miscategorized as clients when enforcing
cephx_*require_signatures options.
The ceph-volume lvm batch --auto introduced by [1] breaks the backward
compatibility when using non rotational devices only (SSD and/or NVMe).
Those devices are reaffected as bluestore db or filestore journal
devices while we want them as data devices.
Kefu Chai [Sun, 8 Mar 2020 06:00:53 +0000 (14:00 +0800)]
qa/tasks/ceph_manager: use StringIO for capturing COT output
there are couple factors we should consider when choosing between
BytesIO and StringIO:
- if the producer is producing binary
- if we are expecting binary
- if the layers in between them are doing the decoding/encoding
automatically.
in our case, the producer is either the ChannelFile instances returned
by paramiko.SSHClient or subprocess.CompletedProcess insances returned
by subprocess.run(). the former are file-like objects opened in "r" mode,
but their contents are decoded with utf-8 when reading if
ChannelFile.FLAG_BINARY is not specified. that's why we always try to
add this flag in orchestra/run.py when collecting the stdout and stderr
from paramiko.SSHClient after executing a command.
back in python2, this works just fine. as we don't differentiate bytes
from str by then.
but in python3, we have to make a decision. in the case of
ceph-objectstore-tool (COT for short), it does not produce binary and
we don't check its output with binary, so, if neither Remote.run() nor
LocalRemote.run() decodes/encodes for us, it's fine.
so it boils down to `copy_to_log()`:
i think we we should respect the consumer's expectation, and only decode
the output if a StringIO is passed in as stdout or stderr.
as we always log the output with logging we could either set
`ChannelFile.FLAG_BINARY` depending on the type of `capture` or not.
if it's not set, paramiko will return str (bytes) on python2, and str on
python3. if it's not set paramiko will return str (bytes) on python2,
and bytes on python3.
if there is non-ASCII in the output, logging will bail fail with
`UnicodeDecodeError` exception. and paramiko throws the same exception
when trying to decode for us if `ChannelFile.FLAG_BINARY` is not
specified.
so to ensure that we always have logging messages no matter if the
producer follows the rule of "use StringIO if you only emit text" or
not, we have to use `ChannelFile.FLAG_BINARY`, and force paramiko
to send us the bytes. but we still have the luxury to use StringIO
and do the decode when the caller asks for str explicitly. that'd save
the pain of using `str.decode()` or `six.ensure_str()` everywhere
even if we can assure that the program does not write binary.
mgr/cephadm: do not configure Dashboard Ganesha settings
The Dashboard can get cluster information from the Orchestrator.
For settings that are set by previous revisions, the Dashboard will
check them and ask user to remove them.
mgr/dashboard: support Orchestrator and user-defined Ganesha clusters
This change make the Dashboard support two types of Ganesha clusters:
- Orchestrator clusters (Since Octopus)
- Deployed by the Orchestrator.
- The Dashboard gets the pool/namespace that stores Ganesha
configuration objects from the Orchestrator.
- The Dashboard gets the daemons in a cluster from the Orchestrator.
- User-defined clusters (Since Nautilus)
- Clusters defined by using `ceph dashboard
set-ganesha-clusters-rados-pool-namespace` command is treated as
user-defined clusters.
- Each daemon has its own RADOS configuration objects. The
Dashboard uses these objects to deduce daemons.
Conflicts:
src/pybind/mgr/dashboard/openapi.yaml
- We don't have openapi-check feature in the Octopus. The file
is removed in the backport.
src/pybind/mgr/dashboard/services/ganesha.py
src/pybind/mgr/dashboard/tests/test_ganesha.py
- The conflicts are mainly caused by code re-format in the
master.
Nathan Cutler [Tue, 27 Oct 2020 20:40:38 +0000 (21:40 +0100)]
doc/PendingReleaseNotes: clean up for 15.2.6
This commit drops release notes that have already been published and
organizes the remaining release notes under a heading so it is clear
they are targeting the 15.2.6 release.
Conflicts:
src/test/mon/MonMap.cc
- do not attempt to introduce boost::intrusive_ptr into Nautilus
- monmap.build_initial takes bare cct in nautilus (master: cct.get())
Conflicts:
src/test/mon/MonMap.cc
- do not attempt to introduce boost::intrusive_ptr into nautilus
- monmap.build_initial takes bare cct in nautilus (master: cct.get())
Dan van der Ster [Tue, 13 Oct 2020 07:08:12 +0000 (09:08 +0200)]
mds: account for closing sessions in hit_session
While stopping an mds we can reply to a request while all client
sessions are closing. We shouldn't assert in this case.
Fixes: https://tracker.ceph.com/issues/47833 Signed-off-by: Dan van der Ster <daniel.vanderster@cern.ch>
(cherry picked from commit 6823d8fb619c07b4e749ae564df565eadc59c187)
Jason Dillaman [Tue, 13 Oct 2020 01:34:25 +0000 (21:34 -0400)]
librbd: update AioCompletion return value before evaluating pending count
If the pending count is decremented before the return value is updated,
there is a possibility of two ASIO threads concurrently decrementing the
pending count down from 2 -> 1 -> 0. In the second thread (the one that
performs the final decrement from 1 -> 0), it can finalize the completion
before the first thread has had a chance to update the return value.
Fixes: https://tracker.ceph.com/issues/47847 Signed-off-by: Jason Dillaman <dillaman@redhat.com>
(cherry picked from commit 94f3bce53c39017028ce44a80697f55af2a82e68)
Jason Dillaman [Fri, 16 Oct 2020 15:25:39 +0000 (11:25 -0400)]
journal: possible race condition between flush and append callback
When notifying the journal recorder of an overflow or if the object
close request has completed due to no more in-flight IO, it was
possible for a race between a flush request and the processing of
an append completion to attempt to kick off duplicate notifications.
Since the overflowed and closed callbacks are properly protected from
duplicates, use a counter instead of a boolean to track possible
in-flight handler callbacks.
Fixes: https://tracker.ceph.com/issues/47880 Signed-off-by: Jason Dillaman <dillaman@redhat.com>
(cherry picked from commit 458ab997fe77ea78803a34c6c9715225aa3413ba)
KeyCount should return object count + common prefix count.
see S3 example: https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html#API_ListObjectsV2_Example_5
rgw/gc: fix for incrementing the perf counter 'gc_retire_object'
in the new gc queue code for omap offload, when gc objects from queue
are deleted. This was missed out initially.
rgw/gc: fixing the condition when marker for a queue is
always reset to empty which causes RGWGC::list to get stuck in
a loop, which ultimately is broken out of when the queue's truncated
flag is false.
1. Check for entries size also while evaluating whether objects cache for
a gc object should be marked as 'transitioned' in case of cls_rgw_gc_list.
When there are no entries, we get back a return value of 0, and the
object cache is not marked as 'transitioned'.
2. Also for the last gc object, we need to check whether the queue is still
under process and set the correct flag.
Missing the two conditions above causes the GC::list to loop continously
over the same gc object.
Xiubo Li [Tue, 20 Oct 2020 05:26:33 +0000 (01:26 -0400)]
qa/cephfs: add session_timeout option support
When the mds revoking the Fwbl caps, the clients need to flush
the dirty data back to the OSDs, but the flush may make the OSDs
to be overloaded and slow, which may take more than 60 seconds to
finish. Then the MDS daemons will report the WRN messages.
For the teuthology test cases, let's just increase the timeout
value to make it work.
jinmyeonglee [Thu, 22 Oct 2020 08:25:51 +0000 (17:25 +0900)]
vstart.sh: fix fs set max_mds bug
Fix a bug where the name used when creating a volume and the name used when setting max_mds were different. Fixes: https://tracker.ceph.com/issues/47946 Signed-off-by: Jinmyeong Lee <jinmyeong.lee@linecorp.com>
(cherry picked from commit 6a9445c2cbe6c0c7045bfaed007cc1920ad132ed)
qa/workunits/rbd: yet another attempt to improve rbd-nbd unmap
Previously it still could race when unmap_device returned success
because the device was not found in `rbd-nbd list-mapped` (the nbd
device was removed) but the test failed because the process was still
found in the ps table.
rgw: radosgw-admin should paginate internally when listing bucket
Currently `radosgw-admin bucket list ...`, when listing a bucket, asks
for the value of "--max-entries" internally. To list a large bucket
entirely the user would have to set "--max-entries" to a large value
(e.g., 10000000). Internally this doesn't paginate, so it will try to
produce the entire list at once. This can consume a lot of memory, and
there are known cases where this induces an out-of-memory crash.
So now we'll set a maximum pagination size of 10,000. So even with
large values of "--max-entries" it will still be able to produce the
full listing without stressing memory, because it will ask for at most
10,000 entries at a time.
Or Friedmann [Thu, 23 Jul 2020 15:36:07 +0000 (18:36 +0300)]
rgw: fix expiration header returned even if there is only one tag in the object the same as the rule
Expiration header returned even if there is only one tag in the object the same as the rule
Signed-off-by: Or Friedmann <ofriedma@redhat.com> Reported-by: Avi Mor <avmor@redhat.com> Fixes: https://tracker.ceph.com/issues/46614
(cherry picked from commit bf7c7e59f390afb53cb1e30a440ab26bb093c11c)
rgw: rgw-orphan-list should use "plain" formatted `rados ls` output
The previous version that used "json-pretty" output for `rados ls`
added complications due to json's escaping of special characters. So
this version returns to the "plain" output for `rados ls` but deals
with entries (oids) that might have namespaces and/or locators as
well.
rgw: allow rgw-orphan-list to note when rados objects are in namespace
Currently namespaces and locators are ignored when `rados ls` is run
by rgw-orphan-list to record RADOS's known objects.
However there have been cases where RADOS objects have a locator, and
when one is included in the listing, the script does not handle it
correctly. Now when objects have locators, we will prevent their
output from entering the .intermediate file.
Additionally we do not expect RGW data objects to be in RADOS
namespaces, so when a namespaced object is detected, we'll error out
with a message.
Conflicts:
src/pybind/mgr/dashboard/frontend/src/app/ceph/shared/smart-list/smart-list.component.html
src/pybind/mgr/dashboard/frontend/src/app/ceph/shared/smart-list/smart-list.component.ts
- Use ngx-bootstrap tabset for tabs.