Sage Weil [Wed, 13 Aug 2014 22:05:05 +0000 (15:05 -0700)]
mds/MDSMap: fix incompat version for encoding
Back in 8f7900a09c8e490c9cd3a6f92ed1f0eb1f47f2a9 we added the new fields
before the 'extended' section, which made the encoding incompatible.
Instead, add them at the end--old clients don't care whether the enabled
flag is set or what the 'fs name' is.
Fixes: #8725 Signed-off-by: Sage Weil <sage@redhat.com>
Sage Weil [Wed, 13 Aug 2014 17:34:53 +0000 (10:34 -0700)]
osd/ReplicatedPG: only do agent mode calculations for positive values
After a split we can get negative values here. Only do the arithmetic if
we have a valid (positive) value that won't through the floating point
unit for a loop.
Fixes: #9082 Tested-by: Karan Singh <karan.singh@csc.fi> Signed-off-by: Sage Weil <sage@redhat.com>
Sage Weil [Wed, 13 Aug 2014 15:30:25 +0000 (08:30 -0700)]
osd: fix require_same_peer_instance from fast_dispatch
The mark-down of old peers needs to take the session_dispatch_lock in order
to safely clear the Session ref cycle. However, for fast dispatch callers,
that lock is already held. Pass a flag down from the callers indicating
whether we need to take the additional lock.
Fixes: #9096 Signed-off-by: Sage Weil <sage@redhat.com>
Samuel Just [Tue, 12 Aug 2014 19:20:28 +0000 (12:20 -0700)]
ReplicatedPG: do not pass cop into C_Copyfrom
We do not know when the objecter will finally let go of this Context. Thus, we
cannot know whether it will happen before the flush, at which point the
object_context held by the cop must have been released.
Also, we simply don't need it, process_copy_chunk alrady works in terms of the
tid!
Fixes: #8894 Reviewed-by: Sage Weil <sage@redhat.com> Signed-off-by: Samuel Just <sam.just@inktank.com>
Josh Durgin [Mon, 11 Aug 2014 23:41:26 +0000 (16:41 -0700)]
librbd: fix error path cleanup for opening an image
If the image doesn't exist and caching is enabled, the ObjectCacher
was not being shutdown, and the ImageCtx was leaked. The IoCtx could
later be closed while the ObjectCacher was still running, resulting in
a segfault. Simply use the usual cleanup path in open_image(), which
works fine here.
Sage Weil [Mon, 11 Aug 2014 03:22:23 +0000 (20:22 -0700)]
msg/Pipe: do not wait for self in Pipe::stop_and_wait()
The fast dispatch code necessitated adding a wait for the fast dispatch
to complete when taking over sockets back in commit 2d5d3097c3998add1061ce253104154d72879237. This included mark_down()
(although I am not certain mark_down was required to fix the previous set
of races).
In any case, if the fast dispatch thread itself tries to mark down its
own connection, it will deadlock in this method waiting for itself to
return and clear reader_dispatching. Skip this wait if we are in fact
the reader thread. This avoids the deadlock.
Alternatively, we could change mark_down() to not use stop_and_wait(), but
I am less clear about the potential races there, so I'm opting for the
minimal (though ugly) fix.
Fixes: #9057 Signed-off-by: Sage Weil <sage@redhat.com>
Sage Weil [Sun, 10 Aug 2014 19:15:38 +0000 (12:15 -0700)]
ceph_test_rados_api: fix cleanup of cache pool
We can't simply try to delete everything in there because some items may
be whiteouts. Instead, flush+evict everything, then remove overlay, and
*then* delete what remains.
Fixes: #9055 Signed-off-by: Sage Weil <sage@redhat.com>
OSD: introduce require_up_osd_peer() function for gating replica ops
This checks both that a Message originates from an OSD, and that the OSD
is up in the given map epoch.
We use it in handle_replica_op so that we don't inadvertently add operations
from down peers, who might or might not know it.
Sage Weil [Mon, 4 Aug 2014 21:57:28 +0000 (14:57 -0700)]
osd: reorder OSDService methods under proper dout_prefix macro
The dout_prefix for OSDService uses get_osdmap() to grab a shared_ptr for
the epoch printout. The OSD one does not, and is not safe to run in all
thread contexts.
In particular, update_osd_stat() is run by the heartbeat thread and can
race with the shared_ptr itself being updated with a new map.
Ironically, if this were simply an OSDMap*, there would be no race since
the pointer is a single word and updates atomically.
Fix this, and any similar issues, by moving the OSDService methods up in
OSD.cc so that they use the safe dout macro.
Fixes: #8998
Backport: firefly (in a minimal form, I think!) Signed-off-by: Sage Weil <sage@redhat.com>
rgw: call processor->handle_data() again if needed
Fixes: #8937
Following the fix to #8928 we end up accumulating pending data that
needs to be written. Beforehand it was working fine because we were
feeding it with the exact amount of bytes we were writing.
Sage Weil [Mon, 4 Aug 2014 01:26:34 +0000 (18:26 -0700)]
msg/SimpleMessenger: drop msgr lock when joining a Pipe
Avoid this deadlock:
- a fault
- delay thread entry gets a fast dispatch message
- drops delay_lock
- calls into fast_dispatch
- reaper tries to reap the pipe
- pipe->join()
- delay_thread->join()
- blocks waiting for delay_thread to exit
- delay thread / fast dispatch blocks on msgr->lock trying to mark_down
The solution is to drop the msgr lock while joining the thread. This will
allow the join() to complete. Adjust the reaper thread to recheck the
exit condition since the lock may have been dropped. The other two callers
do not care.
Fixes: #8891 Signed-off-by: Sage Weil <sage@redhat.com>
Sage Weil [Fri, 1 Aug 2014 03:59:49 +0000 (20:59 -0700)]
osd: do not leak Session* ref in _send_boot()
The get_priv() call returns a ref; make sure we drop it if it exists.
This doesn't happen on every run because usually it is NULL and we take
the other path; it's only after the OSD has been marked down that we reach
the second path.
Sage Weil [Thu, 31 Jul 2014 18:02:55 +0000 (11:02 -0700)]
mon/OSDMonitor: warn when cache pools do not have hit_sets configured
Give users a clue when cache pools are enabled but the hit_set is not
configured. Note that technically this will work, but not well, so for
now let's just steer them away.
Sage Weil [Thu, 31 Jul 2014 16:13:11 +0000 (09:13 -0700)]
osd/ReplicatedPG: check agent_mode if agent is enabled but hit_sets aren't
It is probably not a good idea to try to run the tiering agent without a
hit_set to inform its actions, but it is technically possible. For
example, one could simply blindly evict when we reach the full point.
However, this doesn't work because the agent mode is guarded by a hit_set
check, even though agent_setup() is not. Fix that.
Sage Weil [Wed, 30 Jul 2014 20:40:33 +0000 (13:40 -0700)]
test/cli-integration/rbd: fix trailing space
Newer versions of json.tool remove the trailing ' ' after the comma. Add
it back in with sed so that the .t works on both old and new versions, and
so that we don't have to remove the trailing spaces from all of the test
cases.
Backport: firefly Fixes: #8920 Signed-off-by: Sage Weil <sage@redhat.com>
John Spray [Tue, 22 Jul 2014 01:08:08 +0000 (02:08 +0100)]
mds: handle replaying old format journals
To get back to the reformatting procedure that otherwise
occurs during MDLog::open, introduce an MDLog::reopen call
that MDS can use in the standbyreplay->standby transition
for the special case where the journal is old.
Fixes: #8869 Signed-off-by: John Spray <john.spray@redhat.com>
John Spray [Mon, 21 Jul 2014 17:50:07 +0000 (18:50 +0100)]
mds: refactor MDS boot
* Make boot_start private.
* Define boot stages in enum, replace int with type.
* Merge steps 0 and 1, 0 always fell through to 1.
* starting_done was only ever reached by a fall through
from the previous step, so call it directly from there.
John Spray [Thu, 17 Jul 2014 23:44:38 +0000 (00:44 +0100)]
mds: separate inode recovery queue from MDCache
Refactor to:
* have somewhere to put some logic for doing
background recovery in future.
* trim a few lines from the oversized MDCache.cc
whereever we can.
John Spray [Mon, 28 Jul 2014 14:32:12 +0000 (15:32 +0100)]
tools/cephfs: fuller header in dump/undump
There were two problems here:
* write_pos was modified through an undump/dump cycle,
because it was probed during recovery.
* stream format was being forgotten.
Higher the clone probability to 8% and lower the probability of flatten
to 2%. This should give us longer parent chaines (before this we would
usually have one parent and even then only for a few ops time).
Truncate base images after they have been cloned from to cover more
code paths and make sure that clients look at snapshot parent_overlap
(i.e. parent_overlap of the base image at the time the snapshot was
taken) and not that of the base image (i.e. parent_overlap of the base
image as of now).
librbd: make rbd_get_parent_info() accept NULL out params
The C++ version of rbd_get_parent_info() allows passing NULL for parent
image name, image name and snapshot name out parameters. Make C API do
the same both for consistency and to make it easier to check whether
the image at hand has a parent or not.
Currently for pools with different rules, "ceph df" cannot report
right available space for them, respectively. For detail assisment
of the bug ,pls refer to bug report #8943
This patch fix this bug and make ceph df works correctlly.