acting.size() >= pool.info.min_size is meant to check min_size against
acting set participants, but acting is a vector with placeholders.
actingset is the representation with placeholders removed.
The upshot of this bug is that the activation process will basically
ignore min_size for an ec pool allowing writes in cases where it
shouldn't. PastIntervals::check_new_interval, however, performs
the check correctly, and will therefore discount intervals in which
we really did serve writes as not writeable. This can trigger many
different problem conditions including but not limited to:
- Unfound objects due to accepting a last_update with insufficient
osds
- Lost writes
- Crashes due to peering rules being violated
This bug was originally introduced with recovery below min_size in e5a96fd, and then preserved through refactors in 749a13d and 95bec9.
7cb818a exposed it with with expansion of recovery below min_size
to include ec pools (acting.size() is sufficient for replicated
pools).
Fixes: https://tracker.ceph.com/issues/48613 Fixes: https://tracker.ceph.com/issues/48417 Signed-off-by: Samuel Just <sjust@redhat.com>
(cherry picked from commit 642a1c165499bcbd4cfdf907af313ac7ffe44ff4)
Conflicts:
src/osd/PeeringState.h
Fixes the callers rather than also backporting 95bec9873.
mgr/dashboard: allow getting fresh inventory data from the orchestrator
When there is a device change, a `ceph orch device ls --refresh` command
needs to be called so the orchestrator can invalidate its cache and
refresh all devices on all nodes. Currently, the call is asynchronous and
there is no way to determine is a refresh is done or not.
To allow doing a refresh in the Dashboard:
- The inventory device list is periodically updated with cached data.
- If the user clicks the refresh button, a refresh call is sent to the
orchestrator. Thus if there are device changes, it will be revealed soon
because of the periodical update.
Mykola Golub [Tue, 11 May 2021 06:53:08 +0000 (07:53 +0100)]
osd: don't assert in-flight backfill is always in recovery list
In PrimaryLogPG::on_failed_pull, we unconditionally remove soid
from recovering list, but remove it from backfills_in_flight only
when the backfill source is the primary osd.
Kefu Chai [Fri, 16 Oct 2020 17:10:24 +0000 (01:10 +0800)]
pybind/mgr/dashboard: use setUpClass for initializeing class
instead of relying on __init__(), use setUpClass() to initialize class
for testing. it turns out in pytest > 4, __init__() is called for the
test class but the attributes of the instantiated class is in turn overriden.
Kefu Chai [Thu, 8 Oct 2020 07:13:36 +0000 (15:13 +0800)]
tools/setup-virtualenv.sh: pass --use-feature=2020-resolver to pip
as long as pip supports this option, pass it to `pip install`
to silence warnings and errors like:
ERROR: After October 2020 you may experience errors when installing or updating packages. This is because pip will change the way that it resolves dependency conflicts.
We recommend you use --use-feature=2020-resolver to test your packages with the new resolver before it becomes the default.
autopep8 1.5.4 requires pycodestyle>=2.6.0, but you'll have pycodestyle 2.5.0 which is incompatible.
pytest-cov 2.10.1 requires pytest>=4.6, but you'll have pytest 3.10.1 which is incompatible.
pybind/mgr/dashboard: move pytest into requirements.txt
before this change, pytest is included by both requirements-lint.txt
and requirements-test.txt. this fails the install-deps.sh script when
collecting the python package wheels:
Dan van der Ster [Thu, 29 Apr 2021 23:06:17 +0000 (01:06 +0200)]
mgr/progress: ensure progress stays between [0,1]
If _original_pg_count is 0 then progress can be negative.
Fixes: https://tracker.ceph.com/issues/50591 Related-to: https://tracker.ceph.com/issues/50587 Signed-off-by: Dan van der Ster <daniel.vanderster@cern.ch>
(cherry picked from commit 20990a94598d0249745e2ec25c9197d842119d92)
librbd/mirror/snapshot: avoid UnlinkPeerRequest with a unlinked peer
CreatePrimaryRequest could create some UnlinkPeerRequest with an already
unlinked peer in a scenario where you have multiple peers. This request
will not remove the peer (as it's already not linked to the requested
peer) and will skip deletion of the mirror snapshot if another peer
remains. Eventually the code will go through an infinite recursive loop
between CreatePrimaryRequest and UnlinkPeerRequest and segfault.
This commit adds an extra condition to make sure to not submit a
UnlinkPeerRequest if the peer is not linked to the current snapshot. If
there is already no peer in the list it will submit a UnlinkPeerRequest
to remove the snapshot.
Fixes: https://tracker.ceph.com/issues/50439 Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
(cherry picked from commit c6e2953fdb9e29cfb5fb4e04fd633862160cdb13)
Ilya Dryomov [Tue, 4 May 2021 13:50:05 +0000 (15:50 +0200)]
common/buffer: adjust align before calling posix_memalign()
posix_memalign() requires alignment argument to be a multiple of
sizeof(void *). Since it is an implementation detail of buffer,
it needs to be adjusted there -- buffer consumers have no way of
knowing that passing e.g. align == 4 is incorrect.
One place already does the adjustment, but only for align == 0.
The other just asserts. Fix both and remove the "power of two"
assertion. Let posix_memalign() return EINVAL and handle that
by throwing buffer::bad_alloc, as expected by the consumers.
os/FileStore: fix to handle readdir error correctly
Currently filestore code does not handle readdir error.
As man readdir(3) says, we need to check errno after readdir
returns NULL to determine if error happens or not.
This patch fixes the all readdir() calls to check errono and
handle it appropriately:
- FileStore.cc ... abort if EIO error happens
- BtrfsFileStoreBAckend.cc/LFNindex.cc
... return error to upper layer
Without this fixes, primary PG could fail to correctly perform
backfill operation and could lead data loss propagation described
in #50558.
J. Eric Ivancich [Wed, 14 Apr 2021 17:55:22 +0000 (13:55 -0400)]
rgw: during reshard lock contention, adjust logging
When RGW fails to get a lock on a reshard log, we log it in such a way
that it looks like an error. Instead we'll make sure that the log
message is informational.
Xiubo Li [Wed, 21 Apr 2021 13:00:19 +0000 (21:00 +0800)]
mds: do not trim the inodes from the lru list in standby_replay
In standby_replay, if some dentries just added/linked but not get a
chance to replay the EOpen journals followed, if the upkeep_main() is
excuted, which will may trim them out immediately. Then when playing
the EOpen journals later the replay will fail.
In standby_replay, let's skip trimming them if dentry's linkage inode
is not nullptr.
Patrick Donnelly [Tue, 30 Mar 2021 03:09:30 +0000 (20:09 -0700)]
mds: trim cache regularly for standby-replay
This change is slightly awkward because standby-replay MDS do not do all
the kinds of upkeep a normal active MDS does. In particular, it is not
going to recall client state from clients.
This diff also merges the extra recall_client_state in
MDCache::check_memory_usage into its only caller (the upkeep thread)
where it was also doing a recall. That's just a matter of merging the
recall flags. This has the added benefit of making
MDCache::check_memory_usage callable for all MDS daemons regardless of
state.
Fixes: https://tracker.ceph.com/issues/50048 Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry picked from commit 19293d9b9d19c32af4de655cd59e206056b2417d)
While displaying the host pattern in the OSDs placement tab, it gets splited with semi-colons. Also adjusted the column size of Container Image ID and Placement columns.
mgr/dashboard: filesystem pool size should use stored stat
Fixes: https://tracker.ceph.com/issues/50195 Signed-off-by: Avan Thakkar <athakkar@redhat.com>
Replaces 'bytes_used' with 'stored' stat to see the correct results
of CephFS pool stats.
mon/MonClient: reset authenticate_err in _reopen_session()
Otherwise, if "mon host" list has at least one unqualified IP address
without a port and both msgr1 and msgr2 are turned on, there is a race
affecting MonClient::authenticate().
For backwards compatibility reasons such an address is expanded into
two entries, each being treated as a separate monitor. For example,
"mon host = 1.2.3.4" generates the following initial monmap:
0: v1:1.2.3.4:6789/0
1: v2:1.2.3.4:3300/0
See MonMap::_add_ambiguous_addr() for details.
Then, the following can happen:
1. we connect to both endpoints and attempt to authenticate
2. authenticate() sets authenticate_err to 1 and sleeps on auth_cond
3. msgr1 authenticates first (i.e. it gets the final MAuth message
before msgr2 gets the monmap)
4. active_con is set to msgr1 connection, msgr2 connection is closed
as redundant
5. _finish_auth() sets authenticate_err to 0 and signals auth_cond,
but before either the monmap is received or authenticate() wakes
up, msgr1 connection is closed due to a network hiccup
6. ms_handle_reset() calls _reopen_session() which clears active_con
and again connects to both endpoints and attempts to authenticate
7. authenticate() wakes up, sees that there is no active_con and goes
back to sleep, but this time with authenticate_err == 0
8. msgr2 authenticates first but doesn't call _finish_auth() because
it is called only if authenticate_err == 1
9. active_con is set to msgr2 connection, msgr1 connection is closed
as redundant
10. authenticate() hangs on auth_cond until timeout defaulting to 5
minutes
The discrepancy between msgr1 and msgr2 plays a key role. For msgr1,
authentication is considered to be complete as soon as the final MAuth
message is received -- the monmap is not waited for. For msgr2,
authentication is considered to be complete only after the monmap is
received.
Avoid the race by setting authenticate_err to 1 in _reopen_session(),
so that _finish_auth() is called on/after every authentication attempt
instead of just the first one.