Samuel Just [Tue, 7 May 2013 23:41:22 +0000 (16:41 -0700)]
OSD: handle stray snap collections from upgrade bug
Previously, we failed to clear snap_collections, which causes split to
spawn a bunch of snap collections. In load_pgs, we now clear any such
snap collections and then snap_collections field on the PG itself.
Related: #4927 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 8e89db89cb36a217fd97cbc1f24fd643b62400dc)
Samuel Just [Tue, 7 May 2013 23:35:57 +0000 (16:35 -0700)]
PG: clear snap_collections on upgrade
Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 252d71a81ef4536830a74897c84a7015ae6ec9fe)
Samuel Just [Tue, 7 May 2013 23:34:57 +0000 (16:34 -0700)]
OSD: snap collections can be ignored on split
Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 438d9aa152e546b2008ec355b481df71aa1c51a5)
Sage Weil [Mon, 6 May 2013 18:40:52 +0000 (11:40 -0700)]
ceph-disk: use separate lock files for prepare, activate
Use a separate lock file for prepare and activate to avoid deadlock. This
didn't seem to trigger on all machines, but in many cases, the prepare
process would take the file lock and later trigger a udev event and the
activate would then block on the same lock, either when we explicitly call
'udevadm settle --timeout=10' or when partprobe does it on our behalf
(without a timeout!). Avoid this by using separate locks for prepare
and activate. We only care if multiple activates race; it is
okay for a prepare to be in progress and for an activate to be kicked
off.
Sage Weil [Fri, 3 May 2013 23:20:26 +0000 (16:20 -0700)]
mon: fix init sequence when not daemonizing
We made the common_init_finish and chdir conditional on daemonize in commit 2e0dd5ae6c8751e33d456b2b06c1204b63db959a, breaking init (asok at least)
when -f is specified (as with upstart).
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
Sage Weil [Fri, 3 May 2013 18:29:24 +0000 (11:29 -0700)]
mon: fork early to avoid leveldb static env state
leveldb has static state that prevents it from recreating its worker thread
after our fork(), even when we close and reopen the database (tsk tsk!).
Avoid this by forking early, before we touch leveldb.
Hide the details in a Preforker class. This is modeled after what
ceph-fuse already does; we should convert it later.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
Samuel Just [Wed, 1 May 2013 21:59:08 +0000 (14:59 -0700)]
OSD: load_pgs() should fill in start_split honestly
In load_pgs(), we previously called assigned children starting
at the loaded pg created between its stored epoch and the current
osdmap to have that pg as their parent. This is not correct, some
of the children may have been split in subsequent epochs from children
split in earlier epochs. Instead, do each map individually.
Greg Farnum [Wed, 1 May 2013 21:10:31 +0000 (14:10 -0700)]
dumper: fix Objecter locking
Locking expectations changed at some point, and the Dumper wasn't
updated to comply:
1) We need to take the lock for Objecter, as it
doesn't do so on its own any more.
2) We need to drop the lock in several places so that Objecter
can take delivery of messages
Signed-off-by: Greg Farnum <greg@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Sage Weil [Wed, 1 May 2013 17:57:35 +0000 (10:57 -0700)]
mon/Paxos: update first_committed when we trim
The Paxos::trim() -> ::trim_to() path trims old states but does not
update first_committed. This misinforms later paxos rounds such that
peers think they can participate and end up with COMMIT messages
following the COLLECT/LAST exchange that are for future commits they
can't do anything with and then crash out when they get the BEGIN:
Sage Weil [Wed, 1 May 2013 04:16:16 +0000 (21:16 -0700)]
mon/Paxos: don't ignore peer first_committed
We go to the effort of keeping a map of the peer's first/last committed
so that we can send the right commits during the first phase of paxos,
but we forgot to record the first value. This appears to simply be an
oversight. It is mostly harmless; it just means we send extra states
that the peer already has.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
Samuel Just [Tue, 30 Apr 2013 22:48:10 +0000 (15:48 -0700)]
OSD: clean up in progress split state on pg removal
There are two cases: 1) The parent pg has not yet initiated the split 2) The
parent pg has initiated the split.
Previously in case 1), _remove_pg left the entry for its children in the
in_progress_splits map blocking subsequent peering attempts.
In case 1), we need to unblock requests on the child pgs for the parent on
parent removal. We don't need to bother waking requests since any requests
received prior to the remove_pg request are necessarily obsolete.
In case 2), we don't need to do anything: the child will complete the split on
its own anyway.
Thus, we now track pending_splits vs in_progress_splits. Children in
pending_splits are in state 1), in_progress_splits in state 2). split_pgs
bumps pgs from pending_splits to in_progress_splits atomically with respect to
_remove_pg since the parent pg lock is held in both places.
Fixes: #4813 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
mon: if we get our own sync_start back, drop it on the floor.
We have timeouts that will clean everything up, and this can happen
in some cases that we've decided are legitimate. Hopefully we'll
be able to do something else later.
Revert "mon: when electing, be sure acked leaders have new enough stores to lead"
This was somehow broken -- out-of-date leaders were being elected -- and
we've decided smaller band-aids are more appropriate. We don't completely
revert the MMonElection changes, though -- there have been user clusters
running the code which includes these messages so we can't pretend it
never happened. We can make them clearly unused in the code, though.
ObjectCacher: wait for all reads when stopping flusher
Stopping the flusher is essentially the shutdown step for the
ObjectCacher - the next thing is actually destroying it.
If we leave any reads outstanding, when they complete they will
attempt to use the now-destroyed ObjectCacher. This is particularly a
problem with rbd images, since an -ENOENT can instantly complete many
readers, so the upper layers don't wait for the other rados-level
reads of that object to finish before trying to shutdown the cache.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
elector: trigger a mon reset whenever we bump the epoch
We need to call reset during every election cycle; luckily we
can call it more than once. bump_epoch is (by definition!) only called
once per cycle, and it's called at the beginning, so we put it there.
Fixes #4858.
Signed-off-by: Greg Farnum <greg@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Sage Weil [Tue, 30 Apr 2013 17:26:24 +0000 (10:26 -0700)]
mon: change leveldb block size to 64K
#leveldb on freenode says > 2MB is nonsense (it might explain the weird
behavior we saw). Riak tuning guide suggests 256KB for large data block
environments. Default is 8KB. 64KB seems sane for us.
David Zafman [Sat, 27 Apr 2013 01:05:18 +0000 (18:05 -0700)]
osd: read kb stats not tracked?
In read cases track stats in PG::unstable_stats
Include unstable_stats in write_info() and publish_stats_to_osd()
For now this information may not get persisted
fixes: #2209
Signed-off-by: David Zafman <david.zafman@inktank.com>
David Zafman [Mon, 29 Apr 2013 21:36:18 +0000 (14:36 -0700)]
osd: Rename members and methods related to stat publish
pg_stats_lock to pg_stats_publish_lock
pg_stats_valid to pg_stats_publish_valid
pg_stats_stable to pg_stats_publish
update_stats() to publish_stats_to_osd()
clear_stats() to clear_publish_stats()
Signed-off-by: David Zafman <david.zafman@inktank.com>
Sage Weil [Tue, 30 Apr 2013 00:20:39 +0000 (17:20 -0700)]
mon: enable 'mon compact on trim' by default; trim in larger increments
This resolves the leveldb growth-without-bound problem observed by
mikedawson, and all the badness that stems from it. Enable this by
default until we figure out why leveldb is not behaving better.
While we are at it, trim more states at a time. This will make
compaction less frequent, which should help given that there is some
overhead unrelated to the amount of deleted data.
Fixes: #4815 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Mon, 29 Apr 2013 17:51:00 +0000 (10:51 -0700)]
mon: compact leveldb on bootstrap
This is an opportunistic time to optimize our local data since we are
out of quorum. It serves as a safety net for cases where leveldb's
automatic compaction doesn't work quite right and lets things get out
of hand.
Anecdotally we have seen stores in excess of 30GB compact down to a few
hundred KB. And a 9GB store compact down to 900MB in only 1 minute.
Sage Weil [Mon, 29 Apr 2013 17:51:00 +0000 (10:51 -0700)]
mon: compact leveldb on bootstrap
This is an opportunistic time to optimize our local data since we are
out of quorum. It serves as a safety net for cases where leveldb's
automatic compaction doesn't work quite right and lets things get out
of hand.
Anecdotally we have seen stores in excess of 30GB compact down to a few
hundred KB. And a 9GB store compact down to 900MB in only 1 minute.
Sage Weil [Mon, 29 Apr 2013 18:06:36 +0000 (11:06 -0700)]
mon: remap creating pgs on startup
After Monitor::init_paxos() has loaded all of the PaxosService state,
we should then map creating pgs to osds. This ensures we do so after the
osdmap has been loaded and the pgs actually map somewhere meaningful.
Fixes: #4675 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Mon, 29 Apr 2013 18:11:24 +0000 (11:11 -0700)]
mon: only map/send pg creations if osdmap is defined
This avoids calculating new pg creation mappings if the osdmap isn't
loaded yet, which currently happens when during Monitor::paxos_init()
on startup. Assuming osdmap epoch is nonzero, it should always be
safe to do this (although possibly unnecessary).
More cleanup here is certainly possible, but this is one step toward fixing
the bad behavior for #4675.
Sage Weil [Mon, 29 Apr 2013 17:45:31 +0000 (10:45 -0700)]
client: make dup reply a louder error
If we get a dup reply something is probably wrong! We should make sure
it appears more loudly in the log. In particular, it can lead to out
of sync cap state; see #4853.
Sage Weil [Mon, 29 Apr 2013 17:44:28 +0000 (10:44 -0700)]
client: fix session open vs mdsmap race with request kicking
A sequence like:
- ceph-fuse starts, make_request on getattr
- waits for mds to be active
- tries to open a session
- mds restarts, recovers
- eventually gets session open reply
- sends first getattr (even tho mds is in reconnect state)
- gets mdsmap update that mds is now active
- kicks request, resends getattr
- get first reply
- ignore second reply, caps get out of sync
The bug is that we send the first request when the MDS is still in
the reconnect state. The fix is to loop in make_request so that we
ensure all conditions are satisfied before sending the request. Any
time we wait, we loop, so that we know all conditions (still) pass if
we make it to the end.
Fixes: #4853 Signed-off-by: Sage Weil <sage@inktank.com>