Sage Weil [Thu, 30 May 2013 21:36:41 +0000 (14:36 -0700)]
mon: make compaction bounds overlap
When we trim items N to M, compact over range (N-1) to M so that the
items in the queue will share bounds and get merged. There is no harm in
compacting over a larger range here when the lower bound is a key that
doesn't exist anyway.
Sage Weil [Wed, 29 May 2013 15:35:44 +0000 (08:35 -0700)]
mon/MonitorDBStore: allow compaction of ranges
Allow a transaction to describe the compaction of a range of keys. Do this
in a backward compatible say, such that older code will interpret the
compaction of a prefix + range as compaction of the entire prefix. This
allows us to avoid introducing any new feature bits.
Sage Weil [Tue, 28 May 2013 23:35:55 +0000 (16:35 -0700)]
os/LevelDBStore: do compact_prefix() work asynchronously
We generally do not want to block while compacting a range of leveldb.
Push the blocking+waiting off to a separate thread. (leveldb will do what
it can to avoid blocking internally; no reason for us to wait explicitly.)
Samuel Just [Mon, 15 Apr 2013 23:33:48 +0000 (16:33 -0700)]
PG: don't write out pg map epoch every handle_activate_map
We don't actually need to write out the pg map epoch on every
activate_map as long as:
a) the osd does not trim past the oldest pg map persisted
b) the pg does update the persisted map epoch from time
to time.
To that end, we now keep a reference to the last map persisted.
The OSD already does not trim past the oldest live OSDMapRef.
Second, handle_activate_map will trim if the difference between
the current map and the last_persisted_map is large enough.
Fixes: #4731 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 2c5a9f0e178843e7ed514708bab137def840ab89)
Conflicts:
src/common/config_opts.h
src/osd/PG.cc
- last_persisted_osdmap_ref gets set in the non-static
PG::write_info
Samuel Just [Tue, 21 May 2013 22:22:56 +0000 (15:22 -0700)]
OSDMonitor: skip new pools in update_pools_status() and get_pools_health()
New pools won't be full. mon->pgmon()->pg_map.pg_pool_sum[poolid] will
implicitly create an entry for poolid causing register_new_pgs() to assume that
the newly created pgs in the new pool are in fact a result of a split
preventing MOSDPGCreate messages from being sent out.
Fixes: #4813
Backport: cuttlefish Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 0289c445be0269157fa46bbf187c92639a13db46)
Yehuda Sadeh [Thu, 30 May 2013 19:58:11 +0000 (12:58 -0700)]
rgw: only append prefetched data if reading from head
Fixes: #5209
Backport: bobtail, cuttlefish
If the head object wrongfully contains data, but according to the
manifest we don't read from the head, we shouldn't copy the prefetched
data. Also fix the length calculation for that data.
Yehuda Sadeh [Thu, 30 May 2013 16:34:21 +0000 (09:34 -0700)]
rgw: don't copy object idtag when copying object
Fixes: #5204
When copying object we ended up also copying the original
object idtag which overrode the newly generated one. When
refcount put is called with the wrong idtag the count
does't go down.
mon: Monitor: backup monmap using all ceph features instead of quorum's
When a monitor is freshly created and for some reason its initial sync is
aborted, it will end up with an incorrect backup monmap. This monmap is
incorrect in the sense that it will not contain the monitor's names as
it will expect on the next run.
This results from us being using the quorum features to encode the monmap
when backing it up, instead of CEPH_FEATURES_ALL.
Sage Weil [Wed, 29 May 2013 16:49:11 +0000 (09:49 -0700)]
osd: do not assume head obc object exists when getting snapdir
For a list-snaps operation on the snapdir, do not assume that the obc for the
head means the object exists. This fixes a race between a head deletion and
a list-snaps that wrongly returns ENOENT, triggered by the DiffItersateStress
test when thrashing OSDs.
Fixes: #5183
Backport: cuttlefish Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 29e4e7e316fe3f3028e6930bb5987cfe3a5e59ab)
Samuel Just [Tue, 28 May 2013 18:10:05 +0000 (11:10 -0700)]
HashIndex: sync top directory during start_split,merge,col_split
Otherwise, the links might be ordered after the in progress
operation tag write. We need the in progress operation tag to
correctly recover from an interrupted merge, split, or col_split.
Fixes: #5180
Backport: cuttlefish, bobtail Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 5bca9c38ef5187c7a97916970a7fa73b342755ac)
mon: Paxos: get rid of the 'prepare_bootstrap()' mechanism
We don't need it after all. If we are in the middle of some proposal,
then we guarantee that said proposal is likely to be retried. If we
haven't yet proposed, then it's forever more likely that a client will
eventually retry the message that triggered this proposal.
Basically, this mechanism attempted at fixing a non-problem, and was in
fact triggering some unforeseen issues that would have required increasing
the code complexity for no good reason.
mon: Paxos: finish queued proposals instead of clearing the list
By finishing these Contexts, we make sure the Contexts they enclose (to be
called once the proposal goes through) will behave as their were initially
planned: for instance, a C_Command() may retry the command if a -EAGAIN
is passed to 'finish_contexts', while a C_Trimmed() will simply set
'going_to_trim' to false.
This aims at fixing at least a bug in which Paxos will stop trimming if an
election is triggered while a trim is queued but not yet finished. Such
happens because it is the C_Trimmed() context that is responsible for
resetting 'going_to_trim' back to false. By clearing all the contexts on
the proposal list instead of finishing them, we stay forever unable to
trim Paxos again as 'going_to_trim' will stay True till the end of time as
we know it.
Yehuda Sadeh [Thu, 23 May 2013 04:34:52 +0000 (21:34 -0700)]
rgw: iterate usage entries from correct entry
Fixes: #5152
When iterating through usage entries, and when user id was
provided, we started at the user's first entry and not from
the entry indexed by the request start time.
This commit fixes the issue.
Sage Weil [Fri, 17 May 2013 03:37:05 +0000 (20:37 -0700)]
sysvinit: fix enumeration of local daemons when specifying type only
- prepend $local to the $allconf list at the top
- remove $local special case for all case
- fix the type prefix checks to explicitly check for prefixes
Fugly bash, but works!
Backport: cuttlefish, bobtail Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit c80c6a032c8112eab4f80a01ea18e1fa2c7aa6ed)
Samuel Just [Mon, 13 May 2013 21:23:00 +0000 (14:23 -0700)]
PG: subset_last_update must be at least log.tail
Fixes: 5020
Backport: bobtail, cuttlefish Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: David Zafman <david.zafman@inktank.com>
(cherry picked from commit 72bf5f4813c273210b5ced7f7793bc1bf813690c)
Samuel Just [Tue, 14 May 2013 23:35:48 +0000 (16:35 -0700)]
FileJournal: adjust write_pos prior to unlocking write_lock
In committed_thru, we use write_pos to reset the header.start value in cases
where seq is past the end of our journalq. It is therefore important that the
journalq be updated atomically with write_pos (that is, under the write_lock).
The call to align_bl() is moved into do_write in order to ensure that write_pos
is adjusted correctly prior to write_bl().
Also, we adjust pos at the end of write_bl() such that pos \in [get_top(),
header.max_size) after write_bl().
Fixes: #5020 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit eaf3abf3f9a7b13b81736aa558c9084a8f07fdbe)
Josh Durgin [Thu, 16 May 2013 22:28:40 +0000 (15:28 -0700)]
librbd: make image creation defaults configurable
Programs using older versions of the image creation functions can't
set newer parameters like image format and fancier striping.
Setting these options lets them use all the new functionality without
being patched and recompiled to use e.g. rbd_create3().
This is particularly useful for things like qemu-img, which does not
know how to create format 2 images yet.
Sage Weil [Fri, 17 May 2013 01:40:29 +0000 (18:40 -0700)]
udev: install disk/by-partuuid rules
Wheezy's udev (175-7.2) has broken rules for the /dev/disk/by-partuuid/
symlinks that ceph-disk relies on. Install parallel rules that work. On
new udev, this is harmless; old older udev, this will make life better.
Sage Weil [Wed, 8 May 2013 21:54:33 +0000 (14:54 -0700)]
ceph-create-keys: gracefully handle no data from admin socket
Old ceph-mon (prior to 393c9372f82ef37fc6497dd46fc453507a463d42) would
return an empty string and success if the command was not registered yet.
Gracefully handle that case by retrying.
If we still fail to parse, exit entirely with EINVAL.
Fixes: #4952 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Dan Mick <dan.mick@intank.com>
(cherry picked from commit e2528ae42c455c522154c9f68b5032a3362fca8e)
Samuel Just [Tue, 7 May 2013 23:41:22 +0000 (16:41 -0700)]
OSD: handle stray snap collections from upgrade bug
Previously, we failed to clear snap_collections, which causes split to
spawn a bunch of snap collections. In load_pgs, we now clear any such
snap collections and then snap_collections field on the PG itself.
Related: #4927 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 8e89db89cb36a217fd97cbc1f24fd643b62400dc)
Samuel Just [Tue, 7 May 2013 23:35:57 +0000 (16:35 -0700)]
PG: clear snap_collections on upgrade
Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 252d71a81ef4536830a74897c84a7015ae6ec9fe)
Samuel Just [Tue, 7 May 2013 23:34:57 +0000 (16:34 -0700)]
OSD: snap collections can be ignored on split
Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 438d9aa152e546b2008ec355b481df71aa1c51a5)
Sage Weil [Mon, 6 May 2013 18:40:52 +0000 (11:40 -0700)]
ceph-disk: use separate lock files for prepare, activate
Use a separate lock file for prepare and activate to avoid deadlock. This
didn't seem to trigger on all machines, but in many cases, the prepare
process would take the file lock and later trigger a udev event and the
activate would then block on the same lock, either when we explicitly call
'udevadm settle --timeout=10' or when partprobe does it on our behalf
(without a timeout!). Avoid this by using separate locks for prepare
and activate. We only care if multiple activates race; it is
okay for a prepare to be in progress and for an activate to be kicked
off.
Sage Weil [Fri, 3 May 2013 23:20:26 +0000 (16:20 -0700)]
mon: fix init sequence when not daemonizing
We made the common_init_finish and chdir conditional on daemonize in commit 2e0dd5ae6c8751e33d456b2b06c1204b63db959a, breaking init (asok at least)
when -f is specified (as with upstart).
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
Sage Weil [Fri, 3 May 2013 18:29:24 +0000 (11:29 -0700)]
mon: fork early to avoid leveldb static env state
leveldb has static state that prevents it from recreating its worker thread
after our fork(), even when we close and reopen the database (tsk tsk!).
Avoid this by forking early, before we touch leveldb.
Hide the details in a Preforker class. This is modeled after what
ceph-fuse already does; we should convert it later.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
Samuel Just [Wed, 1 May 2013 21:59:08 +0000 (14:59 -0700)]
OSD: load_pgs() should fill in start_split honestly
In load_pgs(), we previously called assigned children starting
at the loaded pg created between its stored epoch and the current
osdmap to have that pg as their parent. This is not correct, some
of the children may have been split in subsequent epochs from children
split in earlier epochs. Instead, do each map individually.
Greg Farnum [Wed, 1 May 2013 21:10:31 +0000 (14:10 -0700)]
dumper: fix Objecter locking
Locking expectations changed at some point, and the Dumper wasn't
updated to comply:
1) We need to take the lock for Objecter, as it
doesn't do so on its own any more.
2) We need to drop the lock in several places so that Objecter
can take delivery of messages
Signed-off-by: Greg Farnum <greg@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Sage Weil [Wed, 1 May 2013 17:57:35 +0000 (10:57 -0700)]
mon/Paxos: update first_committed when we trim
The Paxos::trim() -> ::trim_to() path trims old states but does not
update first_committed. This misinforms later paxos rounds such that
peers think they can participate and end up with COMMIT messages
following the COLLECT/LAST exchange that are for future commits they
can't do anything with and then crash out when they get the BEGIN:
Sage Weil [Wed, 1 May 2013 04:16:16 +0000 (21:16 -0700)]
mon/Paxos: don't ignore peer first_committed
We go to the effort of keeping a map of the peer's first/last committed
so that we can send the right commits during the first phase of paxos,
but we forgot to record the first value. This appears to simply be an
oversight. It is mostly harmless; it just means we send extra states
that the peer already has.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
Samuel Just [Tue, 30 Apr 2013 22:48:10 +0000 (15:48 -0700)]
OSD: clean up in progress split state on pg removal
There are two cases: 1) The parent pg has not yet initiated the split 2) The
parent pg has initiated the split.
Previously in case 1), _remove_pg left the entry for its children in the
in_progress_splits map blocking subsequent peering attempts.
In case 1), we need to unblock requests on the child pgs for the parent on
parent removal. We don't need to bother waking requests since any requests
received prior to the remove_pg request are necessarily obsolete.
In case 2), we don't need to do anything: the child will complete the split on
its own anyway.
Thus, we now track pending_splits vs in_progress_splits. Children in
pending_splits are in state 1), in_progress_splits in state 2). split_pgs
bumps pgs from pending_splits to in_progress_splits atomically with respect to
_remove_pg since the parent pg lock is held in both places.
Fixes: #4813 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>