Neha Ojha [Mon, 7 Jan 2019 23:26:27 +0000 (15:26 -0800)]
mon/OSDMonitor.cc: make a note about reusing jewel feature bit
For OSD_PGLOG_HARDLIMIT, we have reused a jewel feature bit that was retired
in luminous. Therefore, we need to check the release version for
>= CEPH_RELEASE_LUMINOUS, before using it.
Conflicts:
src/include/rados.h
src/mon/MonCommands.h
src/mon/OSDMonitor.cc
src/osd/OSDMap.cc
Luminous does not have CEPH_OSDMAP_NOSNAPTRIM flag.
In nautilus, CEPH_OSDMAP_PGLOG_HARDLIMIT is set by default,
which is not the case in luminous.
In https://github.com/ceph/ceph/pull/21580 I set a trap to catch some wired
and random segmentfaults and in a recent QA run I was able to observe it was
successfully triggered by one of the test case, see:
The root cause is that there might be holes on log versions, thus the
approx_size() method should (almost) always overestimate the actual number of log entries.
As a result, we might be at the risk of overtrimming log entries.
https://github.com/ceph/ceph/pull/18338 reveals a probably easier way
to fix the above problem but unfortunately it also can cause big performance regression
and hence comes this pr..
For the auto-repair (EIO caused) case, we will not reinitialize
**complete_to** (because last_complete is equal to last_update!)
and hence there is chance that **complete_to** should aleady
point to **log.end()** before we call recover_got.
We could simply drop it here as we (already) logged the **complete_to**
iterator change in a more compatible way a few lines below.
src/osd/PG.cc: remove redundant call to trim_log()
This change is motived by the failure tracked in
https://tracker.ceph.com/issues/25198. The failure highlights a case, when a
call to trim_log() after the PG has recovered, races with the previous op,
on a replica OSD. Since the previous operation has not completed, the
last_complete value for that OSD is not valid, when we try to trim the
log. It is also worth noting that the race is due to MOSDPGTrim going through
the strict queue as a peering message vs regular ops going through the
non-strict queue.
During the investigation of this bug, we noticed that, with
https://tracker.ceph.com/issues/23979, we allow pg log trimming to
happen on the primary and replicas, whenever we cross the upper bound of
the pg log. This also ensures that pg log trimming happens while processing
any new op.
Therefore, the function trim_log(), which earlier served the purpose of
trimming logs on the primary and replicas, just before the PG went into
the Recovered state, is no more required. This acted like a last line of
defense to trim logs, when we did not need the logs any more. But, this call
seems redundant now, because, we are limiting the pg log length at all times.
Conflicts:
src/osd/PGLog.cc: Now it is possible to have complete_to version
less than or equal to trim version, because the pg log length upper
limit is a hard limit, and trim can proceed even when there is
pending recovery/backfill. So do not complain when this happens.
Remove async recovery components: The async recovery feature is not present
in luminous. We do not need commit 22d17fb5aad6ab9d7525d9492c0e96a36d02879e,
which adds a flag to remember async recovery. We have also removed async
recovery requirements from this commit and modified the commit message to
only reflect backfill.
Conflicts:
src/osd/PrimaryLogPG.cc: min_last_complete_ondisk and
pg_log.get_can_rollback_to() are no longer the limit of the pg log.
Make the head of the pg log the new limit for pg log trimming.
https://github.com/ceph/ceph/pull/25824 adds slow request to OSD logs.
To deal with it, whitelist 'slow request' instead of 'slow requests'.
This PR is specific to luminous because later versions whitelist it correctly.
IvanGuan [Fri, 4 Jan 2019 04:22:27 +0000 (12:22 +0800)]
client: fix fuse client hang because its pipe to mds is not ok
If fuse client session had been killed by mds and the mds daemon restart
or hot-standby switch happens right away but the client did not receive
any message from monitor due to network or other whatever reason untill
the mds become active again.Thus cause client didn't do closed_mds_session
lead the seession still is STATE_OPEN but client can't send any message to
mds because its pipe is not ok.So we should close the stale session so that
it can be reopened again.
As the largeish change from master g_conf() isn't in mimic yet, use the g_conf
global structure, also make rgw_op use the value from req_info ceph context as
we do for all the requests
The patch to enforce bounds on max-keys/max-uploads/max-parts had a few
issues that would prevent us from compiling it. Instead of changing the
code provided by the submitter, we're addressing them in a separate
commit to maintain the DCO.
Robin H. Johnson [Fri, 21 Sep 2018 21:49:34 +0000 (14:49 -0700)]
rgw: enforce bounds on max-keys/max-uploads/max-parts
RGW S3 listing operations provided a way for authenticated users to
cause a denial of service against OMAPs holding bucket indices.
Bound the min & max values that a user could pass into the max-X
parameters, to keep the system safe. The default of 1000 is chosen to
match AWS S3 behavior.
Affected operations:
- ListBucket, via max-keys
- ListBucketVersions, via max-keys
- ListBucketMultiPartUploads, via max-uploads
- ListMultipartUploadParts, via max-parts
The Swift bucket listing codepath already enforced a limit, so is
unaffected by this issue.
Prior to this commit, the effective limit is the lower of
osd_max_omap_entries_per_request or osd_max_omap_bytes_per_request.
Backport: luminous, mimic Fixes: http://tracker.ceph.com/issues/35994 Signed-off-by: Robin H. Johnson <rjohnson@digitalocean.com>
(cherry picked from commit d79f68a1e31f4bc917eec1b6bbc8e8446377dc6b)
Conflicts:
src/common/options.cc:
Conflicts due to options from master
mon/config-key: limit caps allowed to access the store
Henceforth, we'll require explicit `allow` caps for commands, or for the
config-key service. Blanket caps are no longer allowed for the
config-key service, except for 'allow *'.
(for luminous and mimic, we're also ensuring MonCap's parser is able to
understand forward slashes '/' when parsing prefixes)
Patrick Donnelly [Mon, 17 Dec 2018 16:34:00 +0000 (08:34 -0800)]
mds: create heartbeat grace config option
Currently the MDS uses the mds_beacon_grace for the heartbeat timeout. If we
need to increase the beacon grace because the MDS is missing beacon replies for
some reason, we still want to see the warnings when the MDS is missing
heartbeats.
Fixes: http://tracker.ceph.com/issues/37674 Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry picked from commit 5c143f3039c1967ca83d8a0cce35bf2a12509aef)
Conflicts:
src/mds/MDSRank.cc : Resolved in heartbeat_reset
ningtao [Thu, 3 Jan 2019 15:20:12 +0000 (23:20 +0800)]
mon: shutdown messenger early to avoid accessing deleted logger
In the monitor shutdown process, the MSG thread exits after the logger is released,
causing the null pointer to be accessed. So move the logger release to the MSG thread after it exits
Fixes: http://tracker.ceph.com/issues/37780 Signed-off-by: ningtao <ningtao@sangfor.com.cn>
(cherry picked from commit 47da5a0caa7edec17ff4253e363571b78372506a)
xie xingguo [Fri, 4 Jan 2019 00:39:01 +0000 (08:39 +0800)]
mon/OSDMonitor: do not populate void pg_temp into nextmap
Due to commit ea723fb, pg_temp with clean acting set are added to inc map.
The original intent was to clear out pg_temps during priming, but as
written it would set a new_pg_temp item clearing the pg_temp even if one
didn't already exist. Adding the up != acting condition in there makes us
only take that path if there is an existing pg_temp entry to remove.
Fixes: https://tracker.ceph.com/issues/37784 Signed-off-by: Aleksei Zakharov <zakharov.a.g@yandex.ru>
(cherry picked from commit b1d3ca5e78eaee509c923f06e9024c23cc6ce31a)
Kefu Chai [Sun, 30 Dec 2018 13:57:04 +0000 (21:57 +0800)]
osd: unlock osd_lock when tweaking osd settings
unlock osd_lock when serving "debug kick_recovery_wq" command
we need to unlock osd_lock temporarily when updating the osd settings,
otherwise we will run into assert failure. because
OSD::handle_conf_change() acquires the osd_lock which is not a recursive
lock.
Kefu Chai [Sun, 30 Dec 2018 13:46:55 +0000 (21:46 +0800)]
osd: use unlock_guard for unlock osd temporarily
when OSD::do_command() gets called, osd_lock is acquired. but when
serving some of these commands, we need to call methods which also
acquire the osd_lock by themselves. for instance,
OSD::handle_conf_change() gets called by cct->_conf.apply_changes().
to allow them to do so, we unlock osd_lock before calling those methods,
and re-lock it after done with them.
unlock_guard is introduced to unlock and re-lock the lock in a RAII style.