Sage Weil [Thu, 20 Oct 2011 22:46:11 +0000 (15:46 -0700)]
osd: PgPriorSet: kill whoami; make PG arg strictly optional
It is only used for the debug output prefix. Make it so we can leave it
out entirely (e.g. for unit tests).
We don't want to, say, pass in the string prefix itself, or else we are
stuck with generating that string even on low debug levels where it won't
be used.
Sage Weil [Tue, 18 Oct 2011 18:40:06 +0000 (11:40 -0700)]
osd: PgPriorSet: restructure lost checks for prior set
When we add down osds to the cur set, we block peering because there
are OSDs that may have data we need and they are not currently up.
When that happens, marking those OSDs as lost may allow peering to
proceed.
Keep an explicit map blocked_by for exactly that set of OSDs (a subset
of cur), and compare lost_by values in prior_set_affected() to that.
Any single OSD from a given interval surviving is sufficient to ensure
that an ACKed write during that interval was committed to disk.
Currently, at least.
In any case, update the prior set calculation to reflect that. Also
make the survival conditional a bit smarter, to include both last_clean
interval (from the OSDs previous interval of up-ness) as well as the
current interval [up_from..up_thru).
Sage Weil [Tue, 18 Oct 2011 00:51:53 +0000 (17:51 -0700)]
osd: do not short-cut up_thru update for new PGs
Commit e731885d2550ee985bf875ab5bb5faf28f1693eb made it possible for
a new PG to go active without forcing the OSDs up_thru to update.
This was motivated by the desire for PG creation by radosgw to go
faster. Radosgw no longer creates a pool per bucket, so this is not
useful there, and it is unclear what other application (that is not
abusing rados pools) would need it.
Since it complicates the prior set calculation for dubious reasons,
let's revert it.
Sage Weil [Tue, 18 Oct 2011 00:44:32 +0000 (17:44 -0700)]
osd: PgPriorSet: revert start_since_joining check
Commit 5b78f5db8c200edcc949033e1badae70fecd2e08 added a check to
prevent some sort of badness when osds were marked lost, but I can't
figure out what it was. Remove the check for now until we can
reproduce/observe the badness in practice, and then write a test and
better motivated/docomented fix.
Sage Weil [Tue, 18 Oct 2011 00:02:23 +0000 (17:02 -0700)]
osd: PgPriorSet: do not include UP osds in prior.cur
The up osds are not (directly) relevant since they are not necessarily
members of the PG. We only care about acting OSDs, which may have
committed writes to the PG during this past interval.
The issue: we redo the prior set calculation if the up_thru for these
OSDs changes in the current map, but the prior_set result does not
depend on the current map's up_thru values in any way; it only depends
on the up_thru in the last epoch of each past interval, and that is
fixed in the past.
Sage Weil [Wed, 12 Oct 2011 17:38:23 +0000 (10:38 -0700)]
cephtool: ability to send commands directly to osds
This makes commands beginning with 'tell <target>' magic in that they go
to the given target instead of to the monitor. This is slightly odd, but
I think it gives the most natural interface for the user, with the tool
Doing The Right Thing for you. E.g.,
ceph tell <someone> something (direct to some daemon)
ceph do something (goes to monitor to do X)
Sage Weil [Wed, 12 Oct 2011 00:51:07 +0000 (17:51 -0700)]
msg: add MCommand, MCommandReply message types
These are similar to MMonCommand[Ack], but aren't PaxosServiceMessage
children, don't include the command in the reply (useless), have a more
generic name.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Fri, 14 Oct 2011 23:49:28 +0000 (16:49 -0700)]
filestore: assert on any unexpected error
Right now, the only errors we expect out of the underlying filesystem are
-ENOENT, -ENODATA, or (as a workaround for extN xattr suckage) -ENOSPC
for certain setxattr operations.
Samuel Just [Thu, 13 Oct 2011 21:35:13 +0000 (14:35 -0700)]
PG: Fix log.empty confusion
Previously, log.empty meant that the log.head was everion_t(). However,
it was in a few places used to mean that log.head == log.tail. Now,
log.empty means log.head == log.tail and log.null() indicates that
log.head is eversion_t().
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Interval tree is an optimized data structure for representing and
querying intervals. Elementary intervals are represented as nodes of an
avl tree and the corresponding data is stored on these nodes based on a
concept of span. This representation allows log(n) (where n is the
number of data) storage. The balanced avl tree allows a log(n) query.
The implementation is a template class that is instantiated based on
parameters : - Interval type - Data type
Signed-off-by: Jojy George Varghese <jvarghese@scalecomputing.com>
Sage Weil [Thu, 13 Oct 2011 16:53:41 +0000 (09:53 -0700)]
osd: bound generate_past_intervals() by oldest map
The oldest osdmap we maintain is a lower bound on last_epoch_clean for the
entire system (assuming the monitor is doing it's job right). We can stop
generating past intervals when we hit it.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Greg Farnum [Wed, 12 Oct 2011 23:37:55 +0000 (16:37 -0700)]
cls_rgw: rewrite rgw_bucket_complete_op to use update.
Unfortunately we can't do multiple writes via the interface -- the
second one will clobber the first one. So use the update functionality
and go through that pain instead.
Greg Farnum [Wed, 5 Oct 2011 20:30:59 +0000 (13:30 -0700)]
objclass: add map interfaces.
Right now, they implement the TMAP functions, plus a few obvious
extras to read/write select keys and the header. In the future it
should be easy to switch them to better mapping implementations.
Sage Weil [Tue, 11 Oct 2011 18:16:20 +0000 (11:16 -0700)]
osd: fix race between op requeueing and _dispatch
If a message is working it's way through _dispatch, and another thread
requeues waiting messages under pg->lock (e.g.
osd->take_waiting(waiting_for_active)), the requeued ops are processed
after the one _dispatch() is chewing on, breaking client ordering.
Instead, add a new OSD::requeue_ops() that reinjects ops back into the
op queue by feeding them to the _handle_*() helpers. Those do last minute
checks before enqueuing the ops.
Fixes: #1490 (again) Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Samuel Just [Mon, 3 Oct 2011 20:29:47 +0000 (13:29 -0700)]
OSD,ReplicatedPG: expire and cleanup unconnected watchers
During handle_notify_timeout or ms_handle_reset, watchers are now marked
unconnected via pg->register_unconnected_watcher. A safe timer event has
been added to trigger OSD::handle_watch_timeout.
remove_watchers_and_notifies (called on role change) cleans up these
events before peering.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Josh Durgin [Thu, 6 Oct 2011 00:07:07 +0000 (17:07 -0700)]
osd, pg: ignore responses to obsolete queries
This adds a query_epoch to notify and log messages, which are
sent in response to queries from the primary during peering. To
guarantee we don't try to process old logs and notifies after
restarting peering, query_epoch is set to the epoch at which the
query was sent. If query_epoch is less than last_peering_reset,
the primary discards the message.
This caused a "bad state machine event" crash in the following
scenario:
1. Primary tells a stray to generate a backlog at epoch 199.
2. The up set changes because a stray goes up.
3. Primary restarts peering at epoch 200.
4. Stray gets new map for epoch 200, sees that acting set did not
change, and sends log to primary.
5. Primary crashes.
Related to #1403, #1449 Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>