Sage Weil [Mon, 3 Jun 2013 04:21:51 +0000 (21:21 -0700)]
ceph-fuse: create finisher threads after fork()
The ObjectCacher and MonClient classes both instantiate Finisher
threads. We need to make sure they are created *after* the fork(2)
or else the process will fail to join() them on shutdown, and the
threads will not exist while fuse is doing useful work.
Put CephFuse on the heap and move all this initalization into the child
block, and make sure errors are passed back to the parent.
Fix-proposed-by: Alexandre Marangone <alexandre.maragone@inktank.com> Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Sat, 1 Jun 2013 00:09:19 +0000 (17:09 -0700)]
mon: start lease timer from peon_init()
In the scenario:
- leader wins, peons lose
- leader sees it is too far behind on paxos and bootstraps
- leader tries to sync with someone, waits for a quorum of the others
- peons sit around forever waiting
The problem is that they never time out because paxos never issues a lease,
which is the normal timeout that lets them detect a leader failure.
Avoid this by starting the lease timeout as soon as we lose the election.
The timeout callback just does a bootstrap and does not rely on any other
state.
I see one possible danger here: there may be some "normal" cases where the
leader takes a long time to issue its first lease that we currently
tolerate, but won't with this new check in place. I hope that raising
the lease interval/timeout or reducing the allowed paxos drift will make
that a non-issue. If it is problematic, we will need a separate explicit
"i am alive" from the leader while it is getting ready to issue the lease
to prevent a live-lock.
Sage Weil [Fri, 31 May 2013 05:52:21 +0000 (22:52 -0700)]
mon: discard messages from disconnected clients
If the client is not connected, discard the message. They will
reconnect and resend anyway, so there is no point in processing it
twice (now and later).
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
- trim more at a time (by an order of magnitude)
- rename fields to paxos_trim_{min,max}; only trim when there are min items
that are trimmable, and trim at most max items at a time.
- adjust the paxos_service_trim_{min,max} values up by a factor of 2.
Since we are compacting every time we trim, adjusting these up mean less
frequent compactions and less overall work for the monitor.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
Yehuda Sadeh [Thu, 30 May 2013 19:58:11 +0000 (12:58 -0700)]
rgw: only append prefetched data if reading from head
Fixes: #5209
Backport: bobtail, cuttlefish
If the head object wrongfully contains data, but according to the
manifest we don't read from the head, we shouldn't copy the prefetched
data. Also fix the length calculation for that data.
Yehuda Sadeh [Thu, 30 May 2013 16:34:21 +0000 (09:34 -0700)]
rgw: don't copy object idtag when copying object
Fixes: #5204
When copying object we ended up also copying the original
object idtag which overrode the newly generated one. When
refcount put is called with the wrong idtag the count
does't go down.
Sage Weil [Thu, 30 May 2013 21:36:41 +0000 (14:36 -0700)]
mon: make compaction bounds overlap
When we trim items N to M, compact over range (N-1) to M so that the
items in the queue will share bounds and get merged. There is no harm in
compacting over a larger range here when the lower bound is a key that
doesn't exist anyway.
mon: Monitor: backup monmap using all ceph features instead of quorum's
When a monitor is freshly created and for some reason its initial sync is
aborted, it will end up with an incorrect backup monmap. This monmap is
incorrect in the sense that it will not contain the monitor's names as
it will expect on the next run.
This results from us being using the quorum features to encode the monmap
when backing it up, instead of CEPH_FEATURES_ALL.
Fixes: #5203 Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Sage Weil [Wed, 29 May 2013 16:49:11 +0000 (09:49 -0700)]
osd: do not assume head obc object exists when getting snapdir
For a list-snaps operation on the snapdir, do not assume that the obc for the
head means the object exists. This fixes a race between a head deletion and
a list-snaps that wrongly returns ENOENT, triggered by the DiffItersateStress
test when thrashing OSDs.
Fixes: #5183
Backport: cuttlefish Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
Sage Weil [Wed, 29 May 2013 15:35:44 +0000 (08:35 -0700)]
mon/MonitorDBStore: allow compaction of ranges
Allow a transaction to describe the compaction of a range of keys. Do this
in a backward compatible say, such that older code will interpret the
compaction of a prefix + range as compaction of the entire prefix. This
allows us to avoid introducing any new feature bits.
Sage Weil [Tue, 28 May 2013 23:35:55 +0000 (16:35 -0700)]
os/LevelDBStore: do compact_prefix() work asynchronously
We generally do not want to block while compacting a range of leveldb.
Push the blocking+waiting off to a separate thread. (leveldb will do what
it can to avoid blocking internally; no reason for us to wait explicitly.)
- check against both front and back cons; either one may have failed.
- close *both* front and back before reopening either. this is
overkill, but slightly simpler code.
- fix leak of con when marking down
- handle race against osdmap update and note_down_osd
Fixes: #5172 Signed-off-by: Sage Weil <sage@inktank.com>
Samuel Just [Tue, 28 May 2013 18:10:05 +0000 (11:10 -0700)]
HashIndex: sync top directory during start_split,merge,col_split
Otherwise, the links might be ordered after the in progress
operation tag write. We need the in progress operation tag to
correctly recover from an interrupted merge, split, or col_split.
Fixes: #5180
Backport: cuttlefish, bobtail Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Yehuda Sadeh [Thu, 23 May 2013 04:34:52 +0000 (21:34 -0700)]
rgw: iterate usage entries from correct entry
Fixes: #5152
When iterating through usage entries, and when user id was
provided, we started at the user's first entry and not from
the entry indexed by the request start time.
This commit fixes the issue.