Sage Weil [Thu, 30 May 2013 21:36:41 +0000 (14:36 -0700)]
mon: make compaction bounds overlap
When we trim items N to M, compact over range (N-1) to M so that the
items in the queue will share bounds and get merged. There is no harm in
compacting over a larger range here when the lower bound is a key that
doesn't exist anyway.
mon: Monitor: backup monmap using all ceph features instead of quorum's
When a monitor is freshly created and for some reason its initial sync is
aborted, it will end up with an incorrect backup monmap. This monmap is
incorrect in the sense that it will not contain the monitor's names as
it will expect on the next run.
This results from us being using the quorum features to encode the monmap
when backing it up, instead of CEPH_FEATURES_ALL.
Fixes: #5203 Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Sage Weil [Wed, 29 May 2013 16:49:11 +0000 (09:49 -0700)]
osd: do not assume head obc object exists when getting snapdir
For a list-snaps operation on the snapdir, do not assume that the obc for the
head means the object exists. This fixes a race between a head deletion and
a list-snaps that wrongly returns ENOENT, triggered by the DiffItersateStress
test when thrashing OSDs.
Fixes: #5183
Backport: cuttlefish Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
Sage Weil [Wed, 29 May 2013 15:35:44 +0000 (08:35 -0700)]
mon/MonitorDBStore: allow compaction of ranges
Allow a transaction to describe the compaction of a range of keys. Do this
in a backward compatible say, such that older code will interpret the
compaction of a prefix + range as compaction of the entire prefix. This
allows us to avoid introducing any new feature bits.
Sage Weil [Tue, 28 May 2013 23:35:55 +0000 (16:35 -0700)]
os/LevelDBStore: do compact_prefix() work asynchronously
We generally do not want to block while compacting a range of leveldb.
Push the blocking+waiting off to a separate thread. (leveldb will do what
it can to avoid blocking internally; no reason for us to wait explicitly.)
- check against both front and back cons; either one may have failed.
- close *both* front and back before reopening either. this is
overkill, but slightly simpler code.
- fix leak of con when marking down
- handle race against osdmap update and note_down_osd
Fixes: #5172 Signed-off-by: Sage Weil <sage@inktank.com>
Samuel Just [Tue, 28 May 2013 18:10:05 +0000 (11:10 -0700)]
HashIndex: sync top directory during start_split,merge,col_split
Otherwise, the links might be ordered after the in progress
operation tag write. We need the in progress operation tag to
correctly recover from an interrupted merge, split, or col_split.
Fixes: #5180
Backport: cuttlefish, bobtail Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Yehuda Sadeh [Thu, 23 May 2013 04:34:52 +0000 (21:34 -0700)]
rgw: iterate usage entries from correct entry
Fixes: #5152
When iterating through usage entries, and when user id was
provided, we started at the user's first entry and not from
the entry indexed by the request start time.
This commit fixes the issue.
Sage Weil [Wed, 22 May 2013 15:44:52 +0000 (08:44 -0700)]
osd: ping both front and back interfaces
Send ping requests to both the front and back hb addrs for peer osds. If
the front hb addr is not present, do not send it and interpret a reply
as coming from both. This handles the transition from old to new OSDs
seamlessly.
Note both the front and back rx times. Both need to be up to date in order
for the peer to be healthy.
Sage Weil [Wed, 22 May 2013 21:29:37 +0000 (14:29 -0700)]
messages/MOSDMarkMeDown: fix uninit field
Fixes valgrind warning:
==14803== Use of uninitialised value of size 8
==14803== at 0x12E7614: sctp_crc32c_sb8_64_bit (sctp_crc32.c:567)
==14803== by 0x12E76F8: update_crc32 (sctp_crc32.c:609)
==14803== by 0x12E7720: ceph_crc32c_le (sctp_crc32.c:733)
==14803== by 0x105085F: ceph::buffer::list::crc32c(unsigned int) (buffer.h:427)
==14803== by 0x115D7B2: Message::calc_front_crc() (Message.h:441)
==14803== by 0x1159BB0: Message::encode(unsigned long, bool) (Message.cc:170)
==14803== by 0x1323934: Pipe::writer() (Pipe.cc:1524)
==14803== by 0x13293D9: Pipe::Writer::entry() (Pipe.h:59)
==14803== by 0x120A398: Thread::_entry_func(void*) (Thread.cc:41)
==14803== by 0x503BE99: start_thread (pthread_create.c:308)
==14803== by 0x6C6E4BC: clone (clone.S:112)
Backport: cuttlefish Signed-off-by: Sage Weil <sage@inktank.com>
Samuel Just [Tue, 21 May 2013 22:22:56 +0000 (15:22 -0700)]
OSDMonitor: skip new pools in update_pools_status() and get_pools_health()
New pools won't be full. mon->pgmon()->pg_map.pg_pool_sum[poolid] will
implicitly create an entry for poolid causing register_new_pgs() to assume that
the newly created pgs in the new pool are in fact a result of a split
preventing MOSDPGCreate messages from being sent out.
Fixes: #4813
Backport: cuttlefish Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
mon: Paxos: get rid of the 'prepare_bootstrap()' mechanism
We don't need it after all. If we are in the middle of some proposal,
then we guarantee that said proposal is likely to be retried. If we
haven't yet proposed, then it's forever more likely that a client will
eventually retry the message that triggered this proposal.
Basically, this mechanism attempted at fixing a non-problem, and was in
fact triggering some unforeseen issues that would have required increasing
the code complexity for no good reason.
Fixes: #5102 Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
mon: Paxos: finish queued proposals instead of clearing the list
By finishing these Contexts, we make sure the Contexts they enclose (to be
called once the proposal goes through) will behave as their were initially
planned: for instance, a C_Command() may retry the command if a -EAGAIN
is passed to 'finish_contexts', while a C_Trimmed() will simply set
'going_to_trim' to false.
This aims at fixing at least a bug in which Paxos will stop trimming if an
election is triggered while a trim is queued but not yet finished. Such
happens because it is the C_Trimmed() context that is responsible for
resetting 'going_to_trim' back to false. By clearing all the contexts on
the proposal list instead of finishing them, we stay forever unable to
trim Paxos again as 'going_to_trim' will stay True till the end of time as
we know it.
Fixes: #4895 Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Danny Al-Gaaf [Wed, 22 May 2013 15:28:06 +0000 (17:28 +0200)]
mds/Migrator.cc: fix possible dereference NULL return value
CID 716997 (#1 of 1): Dereference null return value (NULL_RETURNS)
dereference: Dereferencing a pointer that might be null "in" when
calling "MDSCacheObject::is_auth() const".
Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
Danny Al-Gaaf [Wed, 22 May 2013 15:25:16 +0000 (17:25 +0200)]
mds/Migrator.cc: fix possible dereference NULL return value
ID 716998 (#1 of 2): Dereference null return value (NULL_RETURNS)
dereference: Dereferencing a pointer that might be null "in" when
calling "operator <<(std::ostream &, CInode &)".
CID 716998 (#2 of 2): Dereference null return value (NULL_RETURNS)
dereference: Dereferencing a pointer that might be null "in" when
calling "MDCache::add_replica_dir(ceph::buffer::list::iterator &,
CInode *, int, std::list<Context *, std::allocator<Context *> > &)".
Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>