Samuel Just [Fri, 4 Jan 2013 20:43:52 +0000 (12:43 -0800)]
ReplicatedPG: remove old-head optization from push_to_replica
This optimization allowed the primary to push a clone as a single push in the
case that the head object on the replica is old and happens to be at the same
version as the clone. In general, using head in clone_subsets is tricky since
we might be writing to head during the push. calc_clone_subsets does not
consider head (probably for this reason). Handling the clone from head case
properly would require blocking writes on head in the interim which is probably
a bad trade off anyway.
Because the old-head optimization only comes into play if the replica's state
happens to fall on the last write to head prior to the snap that caused the
clone in question, it's not worth the complexity.
Fixes: #3698 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit e89b6ade63cdad315ab754789de24008cfe42b37)
Sage Weil [Fri, 4 Jan 2013 01:15:07 +0000 (17:15 -0800)]
os/FileStore: fix non-btrfs op_seq commit order
The op_seq file is the starting point for journal replay. For stable btrfs
commit mode, which is using a snapshot as a reference, we should write this
file before we take the snap. We normally ignore current/ contents anyway.
On non-btrfs file systems, however, we should only write this file *after*
we do a full sync, and we should then fsync(2) it before we continue
(and potentially trim anything from the journal).
This fixes a serious bug that could cause data loss and corruption after
a power loss event. For a 'kill -9' or crash, however, there was little
risk, since the writes were still captured by the host's cache.
Fixes: #3721 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 28d59d374b28629a230d36b93e60a8474c902aa5)
Samuel Just [Thu, 3 Jan 2013 17:59:45 +0000 (09:59 -0800)]
OSD: for old osds, dispatch peering messages immediately
Normally, we batch up peering messages until the end of
process_peering_events to allow us to combine many notifies, etc
to the same osd into the same message. However, old osds assume
that the actiavtion message (log or info) will be _dispatched
before the first sub_op_modify of the interval. Thus, for those
peers, we need to send the peering messages before we drop the
pg lock, lest we issue a client repop from another thread before
activation message is sent.
Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 4ae4dce5c5bb547c1ff54d07c8b70d287490cae9)
Sage Weil [Thu, 3 Jan 2013 06:20:06 +0000 (22:20 -0800)]
osd: let pgs process map advances before booting
The OSD deliberate consumes and processes most OSDMaps from while it
was down before it marks itself up, as this is can be slow. The new
threading code does this asynchronously in peering_wq, though, and
does not let it drain before booting the OSD. The OSD can get into
a situation where it marks itself up but is not responsive or useful
because of the backlog, and only makes the situation works by
generating more osdmaps as result.
Fix this by calling activate_map() even when booting, and when booting
draining the peering_wq on each call. This is harmless since we are
not yet processing actual ops; we only need to be async when active.
Sage Weil [Fri, 28 Dec 2012 21:07:18 +0000 (13:07 -0800)]
log: broadcast cond signals
We were using a single cond, and only signalling one waiter. That means
that if the flusher and several logging threads are waiting, and we hit
a limit, we the logger could signal another logger instead of the flusher,
and we could deadlock.
Similarly, if the flusher empties the queue, it might signal only a single
logger, and that logger could re-signal the flusher, and the other logger
could wait forever.
Intead, break the single cond into two: one for loggers, and one for the
flusher. Always signal the (one) flusher, and always broadcast to all
loggers.
Backport: bobtail, argonaut Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit 813787af3dbb99e42f481af670c4bb0e254e4432)
Josh Durgin [Sun, 30 Dec 2012 04:35:15 +0000 (20:35 -0800)]
librbd: fix race between unprotect and clone
Clone needs to actually re-read the header to make sure the image is
still protected before returning. Additionally, it needs to consider
the image protected *only* if the protection status is protected -
unprotecting does not count. I thought I'd already fixed this, but
can't find the commit.
Josh Durgin [Sat, 29 Dec 2012 00:54:51 +0000 (16:54 -0800)]
librbd: add {rbd_}open_read_only()
Since 58890cfad5f7bee933baa599a68e6c65993379d4, regular {rbd_}open()
would fail with -EPERM if the user did not have write access to the
pool, since a watch on the header was requested.
For many uses of read-only access, establishing a watch is not
necessary, since changes to the header do not matter. For example,
getting metadata about an image via 'rbd info' does not care if a new
snapshot is created while it is in progress.
Josh Durgin [Sat, 29 Dec 2012 03:47:09 +0000 (19:47 -0800)]
OSD: remove RD flag from CALL ops
20496b8d2b2c3779a771695c6f778abbdb66d92a forgot to do this. Without
this change, all class methods required regular read permission in
addition to class-read or class-write.
Josh Durgin [Sat, 29 Dec 2012 03:44:36 +0000 (19:44 -0800)]
cls_rbd: get_children does not need write permission
This prevented a read-only user from being able to unprotect a
snapshot without write permission on all pools. This was masked before
by the CLS_METHOD_PUBLIC flag.
Remove the special-case check, which does not inform the peer what
protocol features are missing. It also enforces this requirement even
when we negotiate auth none.
Sage Weil [Fri, 28 Dec 2012 19:34:47 +0000 (11:34 -0800)]
osd: allow RecoveryDone self-transition in RepNotRecovering
In a mixed cluster where some OSDs support the recovery reservations and
some don't, the replica may be new code in RepNotRecoverying and will
complete a backfill. In that case, we want to just stayin
RepNotRecovering.
It may also be possible to make it infer what the primary is doing even
thought it is not sending recovery reservation messages, but this is much
more complicated and doesn't accomplish much.
Fixes: #3689 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Thu, 27 Dec 2012 19:12:33 +0000 (11:12 -0800)]
osd: drop 'osd recovery max active' back to previous default (5)
Having this too large means that queues get too deep on the OSDs during
backfill and latency is very high. In my tests, it also meant we generated
a lot of slow recovery messages just from the recovery ops themselves (no
client io).
Keeping this at the old default means we are no worse in this respect than
argonaut, which is a safe position to start from.
Samuel Just [Fri, 21 Dec 2012 23:39:50 +0000 (15:39 -0800)]
PG: Handle repair once in scrub_finish
We don't want to change missing sets during a chunky
scrub since it would cause !is_clean() and derail
the rest of the scrub. Instead, move the missing,
inconsistent, and authoritative sets into scrubber
and add to during scrub_compare_maps(). Then,
handle repairing objects all at once in scrub_finish().
Dan Mick [Fri, 21 Dec 2012 03:53:07 +0000 (19:53 -0800)]
import_export.sh: sparse import export
Add tests for:
- sparse import makes expected sparse images
- sparse export makes expected sparse files
- sparse import from stdin also creates sparse images
- import from partially-sparse file leads to partially-sparse image
- import from stdin with zeros leads to sparse
- export from zeros-image to file leads to sparse file
Signed-off-by: Dan Mick <dan.mick@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Dan Mick [Sat, 8 Dec 2012 06:57:06 +0000 (22:57 -0800)]
rbd: harder-working sparse import from stdin
Try to accumulate image-sized blocks when importing from stdin, even if
each read is shorter than requested; if we get a full block, and it's
all zeroes, we can seek and make a sparse output file
Signed-off-by: Dan Mick <dan.mick@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Sage Weil [Sat, 22 Dec 2012 00:47:50 +0000 (16:47 -0800)]
osd: fix pg stat msgs vs timeout
We can get a pattern like so:
- new mon session
- after say 120 seconds, we decide to send a stats msg
- outstanding_pg_stats is finally true, we immediately time out (30 second
grace), and reconnect to a new mon
-> repeat
The problem is that we don't reset the last_sent timestamp when we send.
Or that we do this check after sending instead of before. Fix both.
This should resolve the issue #3661 where osds that don't have pgs
updating are not stats messags to the mon to check in, and are eventually
getting marked down as a result.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
Sage Weil [Fri, 21 Dec 2012 21:44:19 +0000 (13:44 -0800)]
monc: only warn about missing keyring if we fail to authenticate
This avoids the situation where a librados or other user with the default
of 'cephx,none' and no keyring is authenticating against a cluster with
required of 'none' and an annoying warning is generated every time. Now
we only print a helpful message if we actually failed.
Sage Weil [Fri, 21 Dec 2012 06:01:34 +0000 (22:01 -0800)]
osd: clear scrub state if queued scrub doesn't start
We set SCRUBBING when we queue a pg for scrub. If we dequeue and
call scrub() but abort for some reason (!active, degraded, etc.), clear
that state bit.
Bug is easily reproduced with 'ceph osd scrub N' during cluster startup
when PGs are peering; some PGs can get left in the scrubbing state.
Add ceph osd ls to help; make help for ceph osd tell N bench look
more like injectargs, which says <osd-id or *> to make it clear you
can benchmark all osds simultaneously
Sage Weil [Thu, 20 Dec 2012 21:48:06 +0000 (13:48 -0800)]
log: fix flush/signal race
We need to signal the cond in the same interval where we hold the lock
*and* modify the queue. Otherwise, we can have a race like:
queue has 1 item, max is 1.
A: enter submit_entry, signal cond, wait on condition
B: enter submit_entry, signal cond, wait on condition
C: flush wakes up, flushes 1 previous item
A: retakes lock, enqueues something, exits
B: retakes lock, condition fails, waits
-> C is never woken up as there are 2 items waiting
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Dan Mick <dan.mick@inktank.com>
Samuel Just [Thu, 20 Dec 2012 21:23:27 +0000 (13:23 -0800)]
OSD,ReplicatedPG: do not track notifies on the session
handle_notify_timeout and remove_notify currently do not clean up this
state leaving dangling Notification*. Further, we only use this mapping
in unwatch in order to determine which notifies to update. We can
accomplish the same thing by iterating through the obc->notifs mapping
since all notifications relevant for a given watch would have been for
the same obc as the watch.
Yehuda Sadeh [Wed, 19 Dec 2012 18:21:57 +0000 (10:21 -0800)]
rgw: don't try to assign content type if not found
Fixes: #3648
Cannot assign a NULL pointer into stl string. This is only
relevant to swift, when uploading an object without specifying
content type, and when the suffix cannot be determined.
Yehuda Sadeh [Thu, 20 Dec 2012 00:59:43 +0000 (16:59 -0800)]
rgw: don't initialize keystone if not set up
Fixes: #3653
No need to initialize keystone, including the keystone
revocation thread which was verbose if key stone was
not set up. This removes some unuseful errors from the
log.
Yehuda Sadeh [Wed, 19 Dec 2012 22:34:53 +0000 (14:34 -0800)]
rgw: remove useless configurable, fix swift auth error handling
Fixes: #3649
No need to have an extra configurable to use keystone. Use keystone
whenever keystone url has been specified. Also, fix a bad error
handling that turned a failure to authenticate into successfully
authenticating a bad user.
Sage Weil [Wed, 19 Dec 2012 03:21:24 +0000 (19:21 -0800)]
ceph: report error string to stderr, not stdout
If we return an error, send the message to stderr. This makes things
more easily scriptable because error messages won't take the place of
expected output.
mon: OSDMonitor: add option 'mon_max_pool_pg_num' and limit 'pg_num' accordingly
Instead of having a hardcoded default, use a configurable one. It is
limited to 65536 until future testing guarantees there is no side-effects
of increasing it past this value, but by being adjustable the user still
has the freedom to specify whatever maximum value he wants.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Gary Lowell [Sun, 16 Dec 2012 06:38:58 +0000 (22:38 -0800)]
Makefiles: Two new packages needed in the debian build depdencies.
The ceph test programs that are now being built by default require the junit
and libboost-program-options packages. These have been added to the build
dependecies in the debian control file.
Signed-off-by: Gary Lowell <gary.lowell@inktank.com>
James Page [Wed, 12 Dec 2012 22:06:16 +0000 (22:06 +0000)]
Refactor rule file to separate arch/indep builds.
Prior to the ceph fs java bindings, all packages where
architecture depdendent so the packaging rules file
worked OK; this fixes up the binary-indep/arch targets
to split the builds of architecture dependent and
independent files.
Sage Weil [Sun, 16 Dec 2012 01:45:25 +0000 (17:45 -0800)]
osdc/Objecter: prevent pool dne check from invalidating scan_requests iterator
We iterate over ops and, if the pool dne and other conditions are true,
we will immediately return ENOENT and cancel an op. Increment the
iterator at the top of the loop to avoid invalidating it.
We also need to switch to a map<>, because hash_map<> mutations may
invalidate any/all iterators.
Fixes: #3613 Signed-off-by: Sage Weil <sage@inktank.com>
Greg Farnum [Fri, 14 Dec 2012 22:34:35 +0000 (14:34 -0800)]
qa: add a workunit for fsync-tester
It turns out that our suites don't exercise fsync, at least not very much
(I couldn't find it in all the places I looked for it). This tester
was written by Ted T'so and updated by Chris Mason; I just made it
work on a smaller dataset (256MB) because 8GB against a small cluster takes
more time than we want to wait.