Sage Weil [Sat, 12 Jan 2013 01:23:22 +0000 (17:23 -0800)]
osdmap: spread replicas across hosts with default crush map
This is more often the case than not, and we don't have a good way to
magically know what size of cluster the user will be creating. Better to
err on the side of doing the right thing for more people.
Fixes: #3785 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
Dan Mick [Fri, 11 Jan 2013 02:44:44 +0000 (18:44 -0800)]
rbd: call udevadm settle on map/unmap
When we map/unmap devices, udev gets called to manage device nodes;
this will allow the command to wait for those manipulations to complete,
particularly for test runs, so that the device tree is stable by the
time the command exits.
--no-settle is also provided to avoid this behavior if desired (say,
for a series of 'map' commands, perhaps the user wants to wait for
settling only on the last of the series).
Fixes: #3635 Signed-off-by: Dan Mick <dan.mick@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>
Samuel Just [Thu, 10 Jan 2013 19:06:02 +0000 (11:06 -0800)]
config_opts.h: default osd_recovery_delay_start to 0
This setting was intended to prevent recovery from overwhelming peering traffic
by delaying the recovery_wq until osd_recovery_delay_start seconds after pgs
stop being added to it. This should be less necessary now that recovery
messages are sent with strictly lower priority then peering messages.
Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Gregory Farnum <greg@inktank.com>
Samuel Just [Thu, 10 Jan 2013 03:17:23 +0000 (19:17 -0800)]
ReplicatedPG: fix snapdir trimming
The previous logic was both complicated and not correct. Consequently,
we have been tending to drop snapcollection links in some cases. This
has resulted in clones incorrectly not being trimmed. This patch
replaces the logic with something less efficient but hopefully a bit
clearer.
Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Dan Mick [Tue, 8 Jan 2013 19:21:22 +0000 (11:21 -0800)]
librbd: Allow get_lock_info to fail
If the lock class isn't present, EOPNOTSUPP is returned for lock calls
on newer OSDs, but sadly EIO on older; we need to treat both as
acceptable failures for RBD images. rados lock list will still fail.
Fixes #3744.
Signed-off-by: Dan Mick <dan.mick@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Noah Watkins [Mon, 7 Jan 2013 23:04:33 +0000 (15:04 -0800)]
libcephfs: clarify interface return value
Document that ceph_get_stripe_unit_granularity may return an error code
(e.g. -ENOTCONN). The interface requires a mount, but currently we
return a compile-time constant. Other error codes may be possible in the
future.
Sage Weil [Sun, 6 Jan 2013 16:38:27 +0000 (08:38 -0800)]
msg/Pipe: prepare Message data for wire under pipe_lock
We cannot trust the Message bufferlists or other structures to be
stable without pipe_lock, as another Pipe may claim and modify the sent
list items while we are writing to the socket.
Sage Weil [Sun, 6 Jan 2013 16:33:01 +0000 (08:33 -0800)]
msgr: update Message envelope in encode, not write_message
Fill out the Message header, footer, and calculate CRCs during
encoding, not write_message(). This removes most modifications from
Pipe::write_message().
Sage Weil [Mon, 7 Jan 2013 20:58:39 +0000 (12:58 -0800)]
osdc/Objecter: fix linger_ops iterator invalidation on pool deletion
The call to check_linger_pool_dne() may unregister the linger request,
invalidating the iterator. To avoid this, increment the iterator at
the top of the loop.
Sage Weil [Sun, 6 Jan 2013 16:25:40 +0000 (08:25 -0800)]
msg/Pipe: encode message inside pipe_lock
This modifies bufferlists in the Message struct, and it is possible
for multiple instances of the Pipe to get references on the Message;
make sure they don't modify those bufferlists concurrently.
Sage Weil [Sat, 5 Jan 2013 18:39:08 +0000 (10:39 -0800)]
msg/Pipe: associate sending msgs to con inside lock
Associate a sending message with the connection inside the pipe_lock.
This way if a racing thread tries to steal these messages it will
be sure to reset the con point *after* we do such that it the con
pointer is valid in encode_payload() (and later).
Instead, special-case CALL in the helper--the only point in the code that
actually checks for the RD bit. (And fix one lingering user to use that
helper appropriately.)
Fixes: #3731 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Dan Mick <dan.mick@inktank.com>
Samuel Just [Fri, 4 Jan 2013 20:43:52 +0000 (12:43 -0800)]
ReplicatedPG: remove old-head optization from push_to_replica
This optimization allowed the primary to push a clone as a single push in the
case that the head object on the replica is old and happens to be at the same
version as the clone. In general, using head in clone_subsets is tricky since
we might be writing to head during the push. calc_clone_subsets does not
consider head (probably for this reason). Handling the clone from head case
properly would require blocking writes on head in the interim which is probably
a bad trade off anyway.
Because the old-head optimization only comes into play if the replica's state
happens to fall on the last write to head prior to the snap that caused the
clone in question, it's not worth the complexity.
Fixes: #3698 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Sage Weil [Fri, 4 Jan 2013 01:15:07 +0000 (17:15 -0800)]
os/FileStore: fix non-btrfs op_seq commit order
The op_seq file is the starting point for journal replay. For stable btrfs
commit mode, which is using a snapshot as a reference, we should write this
file before we take the snap. We normally ignore current/ contents anyway.
On non-btrfs file systems, however, we should only write this file *after*
we do a full sync, and we should then fsync(2) it before we continue
(and potentially trim anything from the journal).
This fixes a serious bug that could cause data loss and corruption after
a power loss event. For a 'kill -9' or crash, however, there was little
risk, since the writes were still captured by the host's cache.
Fixes: #3721 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
Samuel Just [Thu, 3 Jan 2013 17:59:45 +0000 (09:59 -0800)]
OSD: for old osds, dispatch peering messages immediately
Normally, we batch up peering messages until the end of
process_peering_events to allow us to combine many notifies, etc
to the same osd into the same message. However, old osds assume
that the actiavtion message (log or info) will be _dispatched
before the first sub_op_modify of the interval. Thus, for those
peers, we need to send the peering messages before we drop the
pg lock, lest we issue a client repop from another thread before
activation message is sent.
Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Sage Weil [Thu, 3 Jan 2013 06:20:06 +0000 (22:20 -0800)]
osd: let pgs process map advances before booting
The OSD deliberate consumes and processes most OSDMaps from while it
was down before it marks itself up, as this is can be slow. The new
threading code does this asynchronously in peering_wq, though, and
does not let it drain before booting the OSD. The OSD can get into
a situation where it marks itself up but is not responsive or useful
because of the backlog, and only makes the situation works by
generating more osdmaps as result.
Fix this by calling activate_map() even when booting, and when booting
draining the peering_wq on each call. This is harmless since we are
not yet processing actual ops; we only need to be async when active.
Fixes: #3714 Signed-off-by: Sage Weil <sage@inktank.com>
Sam Lang [Wed, 2 Jan 2013 22:07:13 +0000 (16:07 -0600)]
fuse: Fix cleanup code path on init failure
With the changes from 856f32ab, the cfuse.init call returns
a _positive_ errno, which was getting ignored. Also, if an
error occurs during cfuse.init(), we need to teardown the client
mount.
Josh Durgin [Wed, 2 Jan 2013 22:15:24 +0000 (14:15 -0800)]
librbd: establish watch before reading header
This eliminates a window in which a race could occur when we have an
image open but no watch established. The previous fix (using
assert_version) did not work well with resend operations.
Josh Durgin [Wed, 2 Jan 2013 20:32:33 +0000 (12:32 -0800)]
Revert "librbd: ensure header is up to date after initial read"
Using assert version for linger ops doesn't work with retries,
since the version will change after the first send.
This reverts commit e1776809031c6dad441cfb2b9fac9612720b9083.