Sébastien Han [Mon, 28 Jun 2021 16:49:15 +0000 (18:49 +0200)]
src/ceph-crash.in: print stderr if available
This is not perfect, but we have seen cases where the CLI returns 0 even
on failure.
For instance:
```sh
[root@rook-ceph-crashcollector-compute-1-66bdfbd886-d2zcd /]# ceph -n client.crash crash post -i /var/lib/ceph/crash/2021-06-28T07\:47\:37.859766Z_6ffb119c-930e-4047-9cfa-a92af783cdd0/meta
malformed crash metadata: time data '2021-06-28T07:47:37.859766' does not match format '%Y-%m-%d %H:%M:%S.%f'
[root@rook-ceph-crashcollector-compute-1-66bdfbd886-d2zcd /]# echo $?
0
```
So until we find the root cause, let's mitigate and perhaps accomodate
for futur similar issues.
Sébastien Han [Mon, 28 Jun 2021 16:36:51 +0000 (18:36 +0200)]
src/ceph-crash.in: fix --name usage
Previously when the --name was used, the code will still go through the
initial list. Now if --name is passed, only this user will be used to
upload crashes.
Ronen Friedman [Sun, 8 Aug 2021 14:14:55 +0000 (14:14 +0000)]
osd/scrub: only telling the scrubber of 'updates' events if these events are waited for
Only the Primary, and only when there is an active scrubbing going on and the
scrubber state machine is in WaitUpdates states, should be notified of 'updates'.
The extra messages we queued and processed earlier, apart from creating redundant
log lines and wasting CPU time, were increasing the chance of a race in the handling
of stale scrubs.
Zac Dover [Wed, 9 Mar 2022 15:26:43 +0000 (01:26 +1000)]
doc/start: remove osd stub from hardware recs
This PR removes a stub section called "OSDS (CEPH-OSD)"
from the hardware-recommendations.rst page. This section
had no content in it.
This PR is made in lieu of a backport. (I had thought
that this section was removed by another PR that was
a backport, but I was wrong. So this fixes it.)
Ilya Dryomov [Tue, 8 Mar 2022 12:56:15 +0000 (13:56 +0100)]
test/librbd/test_notify.py: effect post object map rebuild assert
Instead of just optionally skipping update_features test, commit 9c0b239d70cd ("qa/upgrade: conditionally disable update_features
tests") moved it after rebuild_object_map test. This isn't right
because update_features test invalidates the object map as a side
effect and rebuild_object_map test is what makes it valid again:
Zac Dover [Thu, 24 Feb 2022 07:22:42 +0000 (17:22 +1000)]
doc/start: include A. D'Atri's hardware-recs recs
This PR restores material about partition alignment
and material about separating OS and OSD data that
was removed in an earlier rewrite. The restoration
of this information was requested by Anthony D'Atri in
https://github.com/ceph/ceph/pull/45123/
This PR also includes several refinements to the language
that could not be made to this text until now, owing to my
(Zac's) ignorance and illiteracy.
I call upon Mark Nelson (and anyone else with sufficient
command of the current state of storage technology) to advise
me on whether the Ceph Foundation feels comfortable in the year
2022 referring to QLC as an emerging technology.
Signed-off-by: Zac Dover <zac.dover@gmail.com>
(squash) more notes and revisions
Kefu Chai [Sat, 5 Mar 2022 04:49:57 +0000 (12:49 +0800)]
cmake: pass RTE_DEVEL_BUILD=n when building dpdk
ceph is still using the Makefile based building system for building
DPDK. and DPDK enables -Werror if RTE_DEVEL_BUILD is 'y' which is
enabled by default when the dpdk is built from a git repo.
but newer GCC is more picky than the older versions, to prevent
the possible FTBFS when we switch to newer GCC for building old
branches whose dpdk submodule might be include the changes addressing
those warnings. let's just disable this option.
the only effect of this option is to add -Werror to CFLAGS. but
the building warnings from DPDK is not our focus when developing
Ceph in the most cases. so it should be fine.
Redouane Kachach [Tue, 25 Jan 2022 17:27:56 +0000 (18:27 +0100)]
mgr/cephadm: Adding logic to cleanup several dirs after an rm-cluster Fixes: https://tracker.ceph.com/issues/53010
https://tracker.ceph.com/issues/53815 Signed-off-by: Redouane Kachach <rkachach@redhat.com>
J. Eric Ivancich [Wed, 12 Jan 2022 18:41:42 +0000 (13:41 -0500)]
rgw: fix bucket index list minor calculation bug
When "bucket index list" traverses the different regions in the bucket
index assembling the output, it miscalculates how many entries to ask
for at one point. This fixes that.
This fixes previous "rgw: bucket index list can produce I/O errors".
Credit for finding this bug goes to Soumya Koduri <skoduri@redhat.com>.
J. Eric Ivancich [Mon, 19 Jul 2021 18:24:11 +0000 (14:24 -0400)]
rgw: allow ordered bucket listing to work when many filtered out entries
A previous PR moved the much of the filtering that's part of bucket
listing to the CLS layer. One unanticipated result was that it is now
possible for a call to return 0 entries. In such a case we want to
retry the call with the marker moved forward (i.e., advanced),
repeatedly if necessary, in order to either retrieve some entries or
to hit the end of the entries. This PR adds that functionality.
J. Eric Ivancich [Mon, 19 Jul 2021 18:23:42 +0000 (14:23 -0400)]
rgw: allow CLSRGWConcurrentIO to handle "advancing" retries
When doing an asynchronous/concurrent bucket index operation against
multiple bucket index shards, a special error code is set aside to
indicate that an "advancing" retry of a/some shard(s) is necessary. In
that case another asynchronous call is made on the indicated shard(s)
from the client (i.e., CLSRGWConcurrentIO). It is up to the subclass
of CLSRGWConcurrentIO to handle the retry such that it "advances" and
simply doesn't get stuck, looping forever.
The retry functionality only works when the "need_multiple_rounds"
functionality is not in use.
J. Eric Ivancich [Fri, 16 Jul 2021 19:31:35 +0000 (15:31 -0400)]
rgw: de-conflate shard_id and request_id in CLSRGWConcurrentIO
When using asynchronous (concurrent) IO for bucket index requests,
there are two int ids that are used that need to be kept separate --
shard id and request id. In many cases they're the same -- shard 0
gets request 0, and so forth.
But in preparation for re-requests, those ids can diverge, where
request 13 maps to shard 2. The existing code maintained the OIDs that
went with each request. This PR also maintains the shard id as
well. Documentation has been beefed up to help future developers
navigate this.
J. Eric Ivancich [Tue, 13 Jul 2021 19:36:53 +0000 (15:36 -0400)]
rgw: bucket index list produces incorrect result when non-ascii entries
A recent PR that helped address the issue of non-ascii plain entries
didn't cover all the bases, allowing I/O errors to be produced in some
circumstances during a bucket index list (i.e., `radosgw-admin bi list
...`).
This fixes those issue and does some additional clean-up.
Casey Bodley [Tue, 15 Feb 2022 23:27:10 +0000 (18:27 -0500)]
common: replace BitVector::NoInitAllocator with wrapper struct
in c++20, the deprecated `struct std::allocator<T>::rebind` template was
removed, so `BitVector` no longer compiles. without a `rebind` to
inherit, `std::allocator_traits<NoInitAllocator>::rebind_alloc<U>` was
looking for `NoInitAllocator<U>`, but it isn't a template class
further investigation found that in c++17, `vector<__u32, NoInitAllocator>`
was rebinding this `NoInitAllocator` to `std::allocator<__u32>` and
preventing the no-init optimization from taking effect
instead of messing with the allocator to avoid zero-initialization, wrap
each __u32 in a struct whose constructor does not initialize the value
Mykola Golub [Fri, 18 Feb 2022 10:42:23 +0000 (10:42 +0000)]
rbd-mirror: make mirror properly detect pool replayer needs restart
When a PoolReplayer detects remote pool metadata change it
sets "stopping" flag expecting the Mirror will restart it.
Although setting "stopping" flag makes the PoolReplayer::run
thread to terminate, the thread's is_started function will still
return true until join is called (and reset the thread id).
This made impossible for the Mirror to detect (by calling
PoolReplayer::is_running) that the PoolReplayer needed restart.
Yaarit Hatuka [Tue, 22 Feb 2022 19:22:09 +0000 (19:22 +0000)]
mgr/devicehealth: skip null pages when extracting wear level
Some devices have null pages in their ata_device_statistics struct; skip
those pages in order to avoid an AttributeError when extracting device's
wear level.
Zac Dover [Sun, 5 Dec 2021 14:44:09 +0000 (00:44 +1000)]
doc/start: remove journal info from hardware recs
This PR removes mentions of journaling from the hardware
recommendations.
Journaling was a FileStore-related practice. BlueStore is
the default backend for Ceph OSDs and has been since
Luminous. The documentation should reflect that.
Sarthak0702 [Wed, 16 Feb 2022 12:45:35 +0000 (18:15 +0530)]
mgr/dashboard: Contact Info should be visible only when Ident channel is checked
Fixes:https://tracker.ceph.com/issues/54133 Signed-off-by: Sarthak0702 <sarthak.0702@gmail.com>
(cherry picked from commit 15211a6378a6fee9316f79ba0b27821891527c38)
Conflicts:
src/pybind/mgr/dashboard/frontend/src/app/ceph/cluster/telemetry/telemetry.component.spec.ts
- removed test for perf channel, which doesn't exist in Pacific
Conflicts:
src/pybind/mgr/dashboard/frontend/src/app/ceph/cluster/telemetry/telemetry.component.spec.ts
- removed tests for perf channel, which doesn't exist in Pacific