Sage Weil [Tue, 18 Sep 2012 21:38:47 +0000 (14:38 -0700)]
mon: refactor osd failure report tracking
- use structs to track allegedly failed nodes, and reports against them.
- use methods to handle report, and failure threshold logic.
- calculate failed_since based on OSD's reported failed_for duration
This will make it simpler to extend the logic when we add dynamic
grace periods.
Sage Weil [Tue, 28 Aug 2012 03:02:12 +0000 (20:02 -0700)]
mon: adjust or decay laggy probabilities on osd boot
On each osd boot, determine whether the osd was laggy (wrongly marked down)
or newly booted. Either update the laggy probability and interval or
decay the values, as appropriate.
Sage Weil [Tue, 28 Aug 2012 02:57:48 +0000 (19:57 -0700)]
osdmap: include osd_xinfo_t to track laggy probabilities, timestamps
Track information about laggy probabilities for each OSD. That is, the
probability that if it is marked down it is because it is laggy, and
the expected interval over which it will take to recovery if it is laggy.
We store this in the OSDMap because it is not convenient to keep it
elsewhere in the monitor. Yet. When the new mon infrastructure is in
place, there is a bunch of stuff that can be moved out of the OSDMap
'extended' section into other mon data structures.
Sage Weil [Tue, 11 Sep 2012 21:50:53 +0000 (14:50 -0700)]
obsync: if OrdinaryCallingFormat fails, try SubdomainCallingFormat
This blindly tries the Subdomain calling format if the ordinary method
fails. In particular, this works around buckets that present a
PermanentRedirect message.
See bug #3128.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Matthew Wodrich <matthew.wodrich@dreamhost.com>
Samuel Just [Tue, 11 Sep 2012 18:05:40 +0000 (11:05 -0700)]
ReplicatedPG: do not start_recovery_op if we are already pushing
Should fix bug #2761.
If we are already pushing soid, recovery_ops will only be decremented once for
all current pushes, so only increment recovery_ops if we are not currently
pushing it.
This bug causes us to leak a recovery op and get stuck in backfill.
Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Sage Weil [Tue, 4 Sep 2012 22:25:20 +0000 (15:25 -0700)]
osd: fill in user log entry last after snapdir tran
Reorder the snapdir logic and ctx->at_version adjustments prior to filling
in the object_info_t and user_versions and all that stuff. Adjust
at_version after appending the log entry (so that it points to the next
position/version we will write at.. culminating in the actual user
event).
The user log entry contains the request id, which will be used
by replay ops to put themselves in the correct place in the
waiting_for_commit/ack maps. Thus, the repop needs to be tagged
with the same version as the log entry with the request id.
Thus, the request id bearing log entry should be the last in
the log entry vector.
This should fix #3072, wherein a replay which should wait on
the repop tagged as version '36 will instead wait on '35.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
Sage Weil [Tue, 11 Sep 2012 15:48:34 +0000 (08:48 -0700)]
mon: make redundant osd.NNN argument optional
Instead of 'osd crush set NNN osd.NNN weight loc...', make the second
osd.NNN option optional, and allow either NNN or osd.NNN to specify the
osd id. This makes the usage much more sane, but maintains backward
compatibility.
Sage Weil [Tue, 4 Sep 2012 22:25:20 +0000 (15:25 -0700)]
osd: fill in user log entry last after snapdir tran
Reorder the snapdir logic and ctx->at_version adjustments prior to filling
in the object_info_t and user_versions and all that stuff. Adjust
at_version after appending the log entry (so that it points to the next
position/version we will write at.. culminating in the actual user
event).
The user log entry contains the request id, which will be used
by replay ops to put themselves in the correct place in the
waiting_for_commit/ack maps. Thus, the repop needs to be tagged
with the same version as the log entry with the request id.
Thus, the request id bearing log entry should be the last in
the log entry vector.
This should fix #3072, wherein a replay which should wait on
the repop tagged as version '36 will instead wait on '35.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
Instead of just keeping a flat usage info per bucket, we
now maintain a list of categories for which requests
usage is aggregated in. Ops are put in categories based
on their names.
Samuel Just [Wed, 5 Sep 2012 22:56:25 +0000 (15:56 -0700)]
PG: clear want_acting in choose_acting if want == acting
Otherwise, a pg_temp from a previous peering sequence
(but not a different peering_interval) might leak through
into Active and incorrectly trip the
Active::react(AdvMap&) asserts regarding want_acting.
Those asserts assume that want_acting is either empty or is
a results of recovery completion. In the latter case, the
want_acting set much consist only of elements of up and
acting.
Mike Ryan [Mon, 27 Aug 2012 18:16:17 +0000 (11:16 -0700)]
osd: deep scrub, read file contents from disk and compare digest
Deep scrub reads the contents of every file from the store and computes
a crc32 digest. The primary compares the digest of all replicas and will
mark the PG inconsistent if any don't match.
OSDs that do not support deep scrub simply perform an ordinary chunky
scrub. Any subset of OSDs that do support deep scrub will have their
digests compared.
Mike Ryan [Mon, 16 Jul 2012 22:58:26 +0000 (15:58 -0700)]
osd: chunky scrub, scrub PGs a chunk of objects at a time
Chunky scrub is a more efficient scrub. It blocks writes on a subset of
objects and scrubs those, allowing writes through to the rest of the PG.
The scrub takes longer to complete than a classic scrub, but improves
overall write throughput.
This feature is backward-compatible with classic scrub. If the primary
detects that any replica does not have the chunky scrub feature, it
falls back to the less efficient classic scrub.