Yaarit Hatuka [Wed, 25 Aug 2021 02:12:08 +0000 (02:12 +0000)]
rpm, debian: move smartmontools and nvme-cli to ceph-base
We wish to be able to scrape SMART and NVMe metrics from OSD and MON
nodes. For this we require / recommend smartmontools and nvme-cli
dependencies for both the ceph-osd and ceph-mon packages. However, the
sudoers file (which is required for invoking `smartctl` by user 'ceph')
was installed only in the ceph-osd package. Since different packages
cannot own the same file, and because we want to be able to scrape from
every daemon, we move the dependencies and the sudoers installation to
ceph-base. For generalization, we rename:
sudoers.d/ceph-osd-smartctl -> sudoers.d/ceph-smartctl
Adam Kupczyk [Sat, 13 Nov 2021 10:28:18 +0000 (11:28 +0100)]
os/bluestore: Fix omap upgrade to per-pg scheme
This is fix to regression introduced by fix to omap upgrade: https://github.com/ceph/ceph/pull/43687
The problem was that we always skipped first omap entry.
This worked fine with objects having omap header key.
For objects without header key we skipped first actual omap key.
Fixes: https://tracker.ceph.com/issues/53260 Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
(cherry picked from commit 65a3f374aa1c57c5bb9401e57dab98a643b4360a)
Those functions should return `None` if 0 or more than 1 item is returned.
The current name of these functions are confusing and can lead to thinking that
we just want the first item returned, even though it returns more than 1
item, let's rename them to `get_single_pv()`, `get_single_vg()` and
`get_single_lv()`
The current check only allows to request an OSD id that exists but
marked as 'destroyed'.
With this small fix, we can now use `--osd-id` with an id that doesn't
exist at all.
so AvlAllocator can switch from the first-first mode to best-fit mode
without walking through the whole space map tree. in the
highly-fragmented system, iterating the whole tree could hurt the
performance of fast storage system a lot.
the idea comes from openzfs's metaslab allocator.
Signed-off-by: Kefu Chai <kchai@redhat.com>
(cherry picked from commit 5a26875049d13130ffe5954428da0e1b9750359f) Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Conflicts:
src/common/options/global.yaml.in:
- Moved new option into src/common/options.cc
so AvlAllocator can switch from the first-first mode to best-fit mode
without walking through the whole space map tree. in the
highly-fragmented system, iterating the whole tree could hurt the
performance of fast storage system a lot.
the idea comes from openzfs's metaslab allocator.
Signed-off-by: Kefu Chai <kchai@redhat.com>
(cherry picked from commit 40f05b971f5a8064cf9819f80fc3bbf21d5206da) Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Conflicts:
src/common/options/global.yaml.in
- Moved new option into src/common/options.cc
Kefu Chai [Wed, 2 Jun 2021 08:31:18 +0000 (16:31 +0800)]
os/bluestore/AvlAllocator: use cbit for counting the order of alignment
no need to calculate the alignment first, cbits() would suffice. as it
counts the first set bit and the follow 0's in a number. the result
is identical to the cbit(alignment of that number).
Signed-off-by: Kefu Chai <kchai@redhat.com>
(cherry picked from commit 573cbb796e8ba2f433caa308925735101a8161a6) Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
before this change AvlAllocator::_block_picker() is used by both the
best-fit mode and first-fit mode. but since we cannot achieve the
locality by searching in the area pointed by curosr in best-fit mode,
we just pass a dummy cursor to AvlAllocator::_block_picker() when
searching in the best-fit mode.
but since the range_size_tree is already sorted by the size of ranges,
if _block_picker() fails to find one by the size, we should just give
up right away, and instead try again using a smaller size.
after this change, instead of sharing AvlAllocator::_block_picker()
across both the first-fit mode and the best-fit mode, this method
is specialize to two different variants: one for first-fit, and the
simplified one for best-fit.
Signed-off-by: Kefu Chai <kchai@redhat.com>
(cherry picked from commit 4837166f9e7a659742d4184f021ad12260247888) Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Adam Kupczyk [Wed, 19 May 2021 10:49:37 +0000 (12:49 +0200)]
os/bluestore: Improve _block_picker function
Make _block_picker function scan (*cursor, end) + (begin, *cursor) instead of (*cursor, end) + (begin, end).
The second run over range (*cursor, end) could never yield any results.
Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
(cherry picked from commit c732060d3e3ef96c6da06c9dde3ed8c064a50965) Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Igor Fedotov [Tue, 9 Feb 2021 15:29:01 +0000 (18:29 +0300)]
os/bluestore: cap omap naming scheme upgrade transactoin.
We shouldn't use single per-onode transaction for such an upgrade when onode's omap list is huge. This results in similarly sized WAL/SST files which are inefficient, might cause high memory usage and sometimes error-prone.
Fixes: https://tracker.ceph.com/issues/49170 Signed-off-by: Igor Fedotov <ifedotov@suse.com>
(cherry picked from commit e897fa243c1dd38329733b452872616023f14ac8)
Conflicts (caused by lack of per-pg omap scheme):
src/os/bluestore/BlueStore.cc
src/os/bluestore/BlueStore.h
src/os/bluestore/bluestore_types.h
test/rgw: fix use of poll() with timers in unittest_rgw_dmclock_scheduler
the AsyncScheduler uses an asio timer to dispatch work to its executor
with an optional delay. when no delay is requested, it waits on the
timer with an expiration time in the past (crimson::dmclock::TimeZero)
tests are failing here because poll() is returning without executing the
handlers of those expired timers
asio implements these timers with timerfd and epoll. debugging with
strace, i see that these timers armed with timerfd_settime() are not
always immediately ready according to epoll_wait():
Kefu Chai [Sat, 30 Oct 2021 03:18:17 +0000 (11:18 +0800)]
admin/doc-requirements.txt: pin Sphinx at 3.5.4
* pin Sphinx at 3.5.4
* pin docutils at 0.18
at least the combination of these two versions
is known to compile.
to address the bug reported at
https://sourceforge.net/p/docutils/bugs/431/
the backtrace looks like:
/home/jenkins-build/build/workspace/ceph-pr-docs/build-doc/virtualenv/lib/python3.8/site-packages/sphinx/util/docutils.py:285:
RemovedInSphinx30Warning: function based directive support is now
deprecated. Use class based directive instead.
warnings.warn('function based directive support is now deprecated. '
Exception occurred:
File
"/home/jenkins-build/build/workspace/ceph-pr-docs/build-doc/virtualenv/lib/python3.8/site-packages/docutils/writers/html5_polyglot/__init__.py",
line 445, in section_title_tags
if (ids and self.settings.section_self_link
AttributeError: 'Values' object has no attribute 'section_self_link'
Kefu Chai [Tue, 17 Aug 2021 07:53:51 +0000 (15:53 +0800)]
mgr/dashboard/api: set a UTF-8 locale when running pip
ansible-core started to include files whose filenames are encoded in
non-ascii characters, so we have to use a more capable encoding for the
locale in order to install this package. otherwise we'd have following
error:
Collecting ansible-core<2.12,>=2.11.3
Using cached ansible-core-2.11.4.tar.gz (6.8 MB)
ERROR: Exception:
Traceback (most recent call last):
File "/tmp/tmp.fX76ASIrch/venv/lib/python3.8/site-packages/pip/_internal/cli/base_command.py", line 173, in _main
status = self.run(options, args)
...
File "/tmp/tmp.fX76ASIrch/venv/lib/python3.8/site-packages/pip/_internal/utils/unpacking.py", line 226, in untar_file
with open(path, "wb") as destfp:
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 137-140: ordinal not in range(256)
The ragodgw-admin bucket limit check command has a bug in
octopus. Since we do not clear the bucket list before
list_buckets() returns the next max_entries, they are appended
to the existing list and we end up counting the first ones again.
This bug is triggered if bucket count exceeds max_entries and
causes duplicates in the output of radosgw-admin bucket limit check.
The fix clears the buckets structure before the list_buckets()
populates it again with the next lot of buckets to iterate through.
Mykola Golub [Fri, 20 Mar 2020 16:19:25 +0000 (16:19 +0000)]
ceph-erasure-code-tool: new tool to encode/decode files
E.g. it may be useful as a last resort when recovering an object from
a damaged PG: extract the encoded object chunks from the PG shards
with ceph-objectstore-tool and then decode with ceph-erasure-code-tool.
It also has functionality similar to what ceph_erasure_code test provides.
Kefu Chai [Thu, 10 Jun 2021 12:19:09 +0000 (20:19 +0800)]
tasks/ceph_manager: ignore EACCES when waiting for quorum
mon_tick_interval is 5 seconds by default. monitors update their
rotating keys every mon_tick_interval. before monitors forms a
quorum, the auth requests from clients are put into the wait list.
these requests are re-enqueued once the monitors form a quorum. but
there is a small window of mon_tick_interval, before they are able
to serve the auth requests even after their claim to be able to
server requests. if these re-enqueued requests happen to be served
in this window, and if authx is enabled, they will be greeted with
errors like
handle_auth_bad_method server allowed_methods [2] but i only support [2]
in the case of ceph cli, the error would look like:
[errno 13] RADOS permission denied (error connecting to the cluster)
so, to address this issue, the EACCES error is ignored when waiting
for a quorum.
ceph-monstore-tool: use a large enough paxos/{first,last}_committed
so the rebuild paxos transaction won't be overwritten by the ones
created before recovery completes.
when the quorum is recovering, the leader will collect the paxos
transactions from peons. if the quorum accept the proposal for setting
the fingerprint, the peon will update the monitor with the paxos
transaction with a newer "last_committed" than the one created using
update_paxos() in ceph_monstore_tool.cc. the latter "last_committed" is
always 0.
so, to avoid this extra paxos proposal obsoleting the "rebuilding" paxos
transaction, we use a large enough number for {first,last}_committed.
Snapshot replayer needs the remote's mirror peer uuid to find its
snapshots in the remote image. It is obtained by listing remote's
mirror peers but RemotePoolPoller::handle_mirror_peer_list() skips
tx-only (MIRROR_PEER_DIRECTION_TX) peers. In effect only rx-tx
(MIRROR_PEER_DIRECTION_RX_TX) peers are considered for matching
and snapshot replayer always fails with "failed to retrieve mirror
peer uuid from remote pool" error.
Instead, skip rx-only (MIRROR_PEER_DIRECTION_RX) peers as we are
definitely not interested in anything having to do with mirroring
_to_ the remote cluster.
auth,mon: don't log "unable to find a keyring" error when key is given
This error is logged even if --key or --keyring are specified and
confuses users because the command actually does its job and exits
with success. This primarily affects "rbd mirror pool peer bootstrap
import" command and rbd-mirror and cephfs-mirror daemons which connect
to the remote cluster with just mon_host and key:
$ rbd mirror pool peer bootstrap import mypool tokenfile
... -1 auth: unable to find a keyring on /etc/ceph/..keyring,/etc/ceph/.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,: (2) No such file or directory
... -1 auth: unable to find a keyring on /etc/ceph/..keyring,/etc/ceph/.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,: (2) No such file or directory
... -1 auth: unable to find a keyring on /etc/ceph/..keyring,/etc/ceph/.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,: (2) No such file or directory
Local cluster commands are affected too:
$ rados --no-config-file --mon-host $MON_HOST --key $KEY lspools
... -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory
... -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory
... -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory
... -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory
... -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory
... -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory
... -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory
... -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory
... -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory
device_health_metrics
rbd
This was introduced in commit 98a2e5c59daa ("rados: translate errno to
str in CLI").
The reaping is asynchronous, so allow for a delay in the count change.
Fixes: https://tracker.ceph.com/issues/50622 Signed-off-by: Sage Weil <sage@newdream.net>
(cherry picked from commit c8c5071dcd4b0b788f5e924a678095ce5dc1d7f8) Signed-off-by: Gerald Yang <gerald.yang.tw@gmail.com>
Sage Weil [Wed, 19 May 2021 19:23:26 +0000 (15:23 -0400)]
msg/async: configurable threshold for reaping dead connections
It is helpful to set this to 1 for tests.
Signed-off-by: Sage Weil <sage@newdream.net>
(cherry picked from commit 8129d6bb953015cc05db458afa6aa9b8f5f62614) Signed-off-by: Gerald Yang <gerald.yang.tw@gmail.com>
Sage Weil [Mon, 19 Apr 2021 14:26:30 +0000 (09:26 -0500)]
msgr/async: fix unsafe access in unregister_conn()
We were looking at anon_conns and accepting_conns without holding
the lock (deleted_lock is not sufficient).
Drop this test, and move the decrements:
- inc when we add to conns or anon_conns (no changes there)
- dec when we remove from deleted_conns (several different paths!)
Fixes: https://tracker.ceph.com/issues/49237 Signed-off-by: Sage Weil <sage@newdream.net>
(cherry picked from commit d51d80b3234e17690061f65dc7e1515f4244a5a3) Signed-off-by: Gerald Yang <gerald.yang.tw@gmail.com>
Cherry-pick notes:
- Options defined in src/common/options.cc in Octopus vs src/common/options/rgw.yaml.in
- RGWQuotaCache::get_stats does not take optional_yeild or DoutPrefixProvider arguments in Octopus
J. Eric Ivancich [Tue, 15 Jun 2021 19:20:33 +0000 (15:20 -0400)]
rgw: when deleted obj removed in versioned bucket, extra del-marker added
After initial checks are complete, this will read the OLH earlier than
previously to check the delete-marker flag and under the bug's
conditions will return -ENOENT rather than create a spurious delete
marker.
Cherry-pick notes:
- RGWRados::apply_olh_log does not take DoutPrefixProvider in Octopus
- change to use some namespace-qualified names in cls_rgw_types
Jeegn Chen [Wed, 25 Nov 2020 09:15:25 +0000 (17:15 +0800)]
rgw: avoid infinite loop when deleting a bucket
When deleting a bucket with an incomplete multipart upload that
has about 2000 parts uploaded, we noticed an infinite loop, which
stopped s3cmd from deleting the bucket forever.
Per check, when the bucket index was sharded (for example 128
shards), the original logic in
RGWRados::cls_bucket_list_unordered() did not calculate
the bucket shard ID correctly when the index key of a data
part was taken as the marker.
The issue is not necessarily reproduced each time. It will depend
on the key of the object. To reproduce it in 128-shard bucket,
we use 334 as the key for the incomplete multipart upload,
which will be located in Shard 127 (known by experiment). In this
setup, the original logic will usually come out a shard ID smaller
than 127 (since 127 is the largest one) from the marker and
thus a circle is constructed, which results in an infinite loop.
PS: Some times the bucket ID calculation may incorrectly going forward
instead of backward. Thus, the check logic may skip some shards,
which may have regular keys. In such scenarios, some non-empty buckets may
be deleted by accident.
Dimitri Savineau [Tue, 24 Aug 2021 21:17:45 +0000 (17:17 -0400)]
ceph-volume: fix lvm activate --all --no-systemd
When using a system without systemd then the `lvm activate --all --no-systemd`
subcommand still calls systemd.
We already allow users to activate a single OSD without systemd so there's
no reason to not do the same with --all (because activate_all calls activate).
In /etc/sysconfig/ceph we allow operators to define if ceph daemons
should be restarted on upgrade: CEPH_AUTO_RESTART_ON_UPGRADE.
But the post selinux scripts will stop ceph.target regardless if this
is set to `no`, leading to operators adding various hacks to prevent
these unexpected or inconvenient daemon restarts. By now, if users
are using rpms directly, they are likely orchestrating their own
daemon restarts so should not rely on the rpm itself to do this.
Fixes: https://tracker.ceph.com/issues/21672 Signed-off-by: Dan van der Ster <daniel.vanderster@cern.ch>
(cherry picked from commit 092a6e3e83e9ef8e37cb6f1033c345dcb5224cfc)