Sage Weil [Tue, 21 Feb 2012 05:11:46 +0000 (21:11 -0800)]
osd: refuse to return data payload if request wrote anything
Write operations aren't allowed to return a data payload because
we can't do so reliably. If the client has to resend the request
and it has already been applied, we will return 0 with no
payload. Non-deterministic behavior is no good.
See #1765.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Sun, 19 Feb 2012 06:17:35 +0000 (22:17 -0800)]
osd: fix up argument to PG::init()
Commit cefa55b288b40e17ade9875493dd94de52ac22bf moved PG initialization
into init(), but passed acting for both up and acting args. This lead to
confusion between primary and replica.
Also fix debug print so that the output is useful.
Fixes: #2075, #2070 Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Sat, 18 Feb 2012 00:23:50 +0000 (16:23 -0800)]
osd: only complete/deregister repop once
It's now possible to send the ack and deregister the repop before the
op_applied() happens. And when that happens, we'll call eval_repop() once
more. Don't do anything in that case.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Florian Haas [Fri, 17 Feb 2012 20:15:15 +0000 (21:15 +0100)]
doc: fix snapshot creation/deletion syntax in rbd man page (trivial)
Creating a snapshot requires using "rbd snap create",
as opposed to just "rbd create". Also for purposes of
clarification, add note that removing a snapshot similarly
requires "rbd snap rm".
Sage Weil [Fri, 17 Feb 2012 21:48:02 +0000 (13:48 -0800)]
osd: make op_commit imply op_applied for purposes of repop completion
For repop completion, we want waitfor_ack and _commit to be empty. For
replicas, a commit reply implies ack, so ack is always a subset of commit.
But for the local write, we wait for applied separately, so we can have
repops open where we sent the reply to the client but still have it open
and consuming memory. And generating 'old request' warnings in the logs
(when the filestore is taking a long time to apply to the fs).
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Fri, 17 Feb 2012 21:19:57 +0000 (13:19 -0800)]
osd: refactor recovery completion
- rename is_all_update() -> needs_recovery(), reverse logic.
- drop up != acting check; that has nothing to do with
recovery itself
- drop trigger in Active::react(const ActMap&)... it's nonsensical
- CompleteRecovery always leads to finish_recovery (or acting set change)
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Fri, 17 Feb 2012 18:55:12 +0000 (10:55 -0800)]
osd: introduce RECOVERING pg state
Since clean now means not degraded, we need some other indication that
recovery has completed and we are "done" (given the current up/down state
of the OSDs).
Adding a 'recovering' state also makes it clearer to users that work is
being done, as opposed to the current situation, where they look for the
absense of 'clean'.
Normally we take a fresh map reference in PG::lock(). However,
_activate_committed needs to make sure the map hasn't changed significantly
before acting. In the case of #2068, the OSD map has moved forward and
the mapping has changed, but the PG hasn't processed that yet, and thus
mis-tags the MOSDPGInfo message.
Tag the message with the e epoch, and also pass down the primary's address
to send the message to the right location.
Fixes: #2068 Signed-off-by: Sage Weil <sage@newdream.net>
Normally we take a fresh map reference in PG::lock(). However,
_activate_committed needs to make sure the map hasn't changed significantly
before acting. In the case of #2068, the OSD map has moved forward and
the mapping has changed, but the PG hasn't processed that yet, and thus
mis-tags the MOSDPGInfo message.
Tag the message with the e epoch, and also pass down the primary's address
to send the message to the right location.
Fixes: #2068 Signed-off-by: Sage Weil <sage@newdream.net>
Sage Weil [Wed, 15 Feb 2012 23:20:35 +0000 (15:20 -0800)]
osd: fix do not always clear DEGRADED/set CLEAN on recovery finish
Clean means we have exactly the right number of replicas and recovery is
complete. Degraded means we do not have enough replicas, either because
recovery is in progress, or because acting is too small.
A consequence is that if we have a PG with len(up) == 1 but a pg_temp
mapping so that len(acting) == 2, it will be active and not clean.
Fixes: #2060 Signed-off-by: Sage Weil <sage@newdream.net> Reviewed-by: Josh Durgin <josh.durgin@dreamhost.com>
Holger Macht [Wed, 15 Feb 2012 16:29:09 +0000 (17:29 +0100)]
ceph.spec.in: Move libcls_*.so from -devel to base package
OSDs (src/osd/ClassHandler.cc) specifically look for libcls_*.so in
/usr/$libdir/rados-classes, so libcls_rbd.so and libcls_rgw.so need to
be shipped along with the base package.
Signed-off-by: Holger Macht <hmacht@suse.de> Signed-off-by: Sage Weil <sage@newdream.net>
Sage Weil [Sun, 12 Feb 2012 22:35:03 +0000 (14:35 -0800)]
osd: semi-clean shutdown on signal
Make some effort to stop work in progress, remove pid file, and exit with
informative error code.
Note that this is much simpler than the shutdown() exit path; I'm not sure
whether a complete teardown is useful. It's also difficult to maintain
and get right with everything else going on, and it's not clear that it's
worth the effort right now.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Samuel Just [Mon, 13 Feb 2012 19:49:42 +0000 (11:49 -0800)]
ReplicatedPG: refactor push and pull
Now, push progress is represented by ObjectRecoveryProgress. In
particular, rather than tracking data_subset_*ing, we track the furthest
offset before which the data will be consistent once cloning is complete.
sub_op_push now separates the pull response implementation from the
replica push implementation.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Sage Weil [Mon, 13 Feb 2012 19:27:11 +0000 (11:27 -0800)]
add CEPH_FEATURE_OSDENC
Require it for osd <-> osd and osd <-> mon communication.
This covers all the new encoding changes, except hobject_t, which is used
between the rados command line tool and the OSD for a object listing
position marker. We can't distinguish between specific types of clients,
though, and we don't want to introduce any incompatibility with other
clients, so we'll just have to make do here. :(
Samuel Just [Sun, 12 Feb 2012 01:53:47 +0000 (17:53 -0800)]
ReplicatedPG: is_degraded may return true for backfill
If is_degraded returns true for backfill, the object may not be
in any replica's missing set. Only call start_recovery_op if
we actually started an op. This bug could cause a stuck
in backfill error.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
Samuel Just [Sun, 12 Feb 2012 01:53:47 +0000 (17:53 -0800)]
ReplicatedPG: is_degraded may return true for backfill
If is_degraded returns true for backfill, the object may not be
in any replica's missing set. Only call start_recovery_op if
we actually started an op. This bug could cause a stuck
in backfill error.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Sage Weil [Mon, 13 Feb 2012 19:06:34 +0000 (11:06 -0800)]
osd: remove peer_stat from MOSDOp entirely
We haven't used this feature for years and years, and don't plan to. It
was there to facilitate "read shedding", where the primary OSD would
forward a read request to a replica. However, replicas can't reply back
to the client in that case because OSDs don't initiate connections (they
used to).
Rip this out for now, especially since osd_peer_stat_t just changed.
Sage Weil [Mon, 13 Feb 2012 02:08:34 +0000 (18:08 -0800)]
osd: protect per-pg heartbeat peers with inner lock
Currently we update the overall heartbeat peers by looking directly at
per-pg state. This is potentially problematic now (#2033), and definitely
so in the future when we push more peering operations into the work queues.
Create a per-pg set of peers, protected by an inner lock, and update it
using PG::update_heartbeat_peers() when appropriate under pg->lock. Then
aggregate it into the osd peer list in OSD::update_heatbeat_peers() under
osd_lock and the inner lock.
We could probably have re-used osd->heartbeat_lock instead of adding a
new pg->heartbeat_peer_lock, but the finer locking can't hurt.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Sun, 12 Feb 2012 05:47:42 +0000 (21:47 -0800)]
osd: flush pg on activate _after_ we queue our transaction
We recently added a flush on activate, but we are still building the
transaction (the caller queues it), so calling osr.flush() here is totally
useless.
Instead, set a flag 'need_flush', and do the flush the next time we receive
some work.
This has the added benefit of doing the flush in the worker thread, outside
of osd_lock.
Sage Weil [Sun, 12 Feb 2012 05:24:54 +0000 (21:24 -0800)]
filestore: make flush() block forever if blackholed
If we are blackholing the disk, we need to make flush() wait forever, or
else the flush() logic will return (the IO wasn't queued!) and higher
layers will continue and (eventually) misbehave.