Sage Weil [Fri, 5 Dec 2008 19:46:55 +0000 (11:46 -0800)]
osd: fix merge_log divergent item detection
An item in our log isn't divergent if it is below the bottom of
olog. Using the last_kept item isn't helpful here because
last_kept is in olog, and may be below that log's bottom.
Sage Weil [Fri, 5 Dec 2008 18:01:28 +0000 (10:01 -0800)]
osd: generate_backlog fixes
Generate backlog records even if the object appears in the log if
the existing entry's prior_version in non-zero and isn't also
in the log. This allows us to accurately generate the .have field
when we are building the missing map.
Sage Weil [Thu, 4 Dec 2008 21:46:19 +0000 (13:46 -0800)]
mon: keep pgmap consistent
We were cutting corners and updating the live map before it
committed to paxos, since pg stats aren't system critical. This
can lead to problems due to the way "latest" is saved out, though,
and it can be confusing to see things jump backward in time.
Sage Weil [Thu, 4 Dec 2008 19:17:58 +0000 (11:17 -0800)]
osd: drop lock during most of scrub; only disallow concurrent writes
Make the PG go read-only during a scrub. Only take the pg lock
when absolutely necessary. Wait for any pending writes to
complete before starting the scrub.
Sage Weil [Thu, 4 Dec 2008 03:46:08 +0000 (19:46 -0800)]
osd: keep projected info on in-progress object modifications in memory
Since the primary delays its writes until after replicas ack, we need to
keep projected object info in memory for the duration, because the
semantics very much depend on whether the object exists and what its size
is (well, mainly the pg_stats do).
This can avoid re-parsing SnapSet et al for certain workloads hitting
the same objects repeatedly (e.g., mds journal objects).
Sage Weil [Wed, 3 Dec 2008 21:55:48 +0000 (13:55 -0800)]
osd: fix small quirk read_log missing generation
The missing entry .have field was probably wrong due to the use
of missing.add_event (which assumes missing is up to date wrt
the previous log entry). Use the prior version we just pulled off
disk instead.
Sage Weil [Wed, 3 Dec 2008 21:41:12 +0000 (13:41 -0800)]
mon: always discard pending on election completion
Previously we tried to save the pending if we were still the
leader. The problem is that while we were not leader, we may have
missed out on some updates, in which case the pending may no longer
be based on the current state.
In the future, we could make the commit waiters smart about callback
return codes so that they try to reapply. For now, don't worry
about it.
Sage Weil [Tue, 2 Dec 2008 19:45:37 +0000 (11:45 -0800)]
mds: avoid overlapping release attempts
If the first release attempt is waiting for the log to flush, we
should avoid sending any RELEASED ack until all releases have
flushed. That is, only the last release will ack. Keep a counter
in Capability to do this.
Otherwise, we may close out a capability from under a release
that is flushing, and our seq # will be meaningless later in
_finish_release_cap() when we're trying to decide what to do.
Sage Weil [Tue, 2 Dec 2008 05:13:34 +0000 (21:13 -0800)]
mds: suspend instead of suicide on beacon timeout
If we don't hear from the monitor, suspend doing any useful work instead
of just committing suicide. If the monitor comes back and hasn't killed
us off, then we're fine. If we've been marked as failed, we will shut
down as before.
Sage Weil [Mon, 1 Dec 2008 19:03:52 +0000 (11:03 -0800)]
osd: send and process heartbeats in separate thread, channel
Use a separate dispatch thread to process heartbeats. Use a
separate thread to send them. This ensures something slow
(e.g. a map update) does not make an osd appear to be down.
This also means a spearate entity_addr for heartbeats, which puts
them over a separate TCP stream.
Sage Weil [Mon, 1 Dec 2008 15:13:30 +0000 (07:13 -0800)]
osd: optionally avoid zeroing trimmed log on disk
This is a half-hearted attempt to keep old PG log content around. It'll
still be lost if a PG moves to another node or the entire log is written
to disk for some other reason.
Sage Weil [Mon, 1 Dec 2008 04:42:44 +0000 (20:42 -0800)]
osd: skip peer_info on down osds
We don't clean old/down OSDs out of peer_info map, since we may not
restart peering when strays go up/down. That's fine... just make sure
we ignore them later.