Samuel Just [Thu, 7 Feb 2013 18:38:00 +0000 (10:38 -0800)]
PG: dirty_info on handle_activate_map
We need to make sure the pg epoch is persisted during
activate_map.
Backport: bobtail Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit dbce1d0dc919e221523bd44e1d0834711da1577d)
Sage Weil [Thu, 7 Feb 2013 18:21:49 +0000 (10:21 -0800)]
osd: flush peering queue (consume maps) prior to boot
If the osd itself is behind on many maps during boot, it will get more and
(as part of that) flush the peering wq to ensure the pgs consume them.
However, it is possible for OSD to have latest/recnet maps, but pgs to be
behind, and to jump directly to boot and join. The OSD is then laggy and
unresponsive because the peering wq is way behind.
To avoid this, call consume_map() (kick the peering wq) at the end of
init and flush it to ensure we are *internally* all caught up before we
consider joining the cluster.
I'm pretty sure this is the root cause of #3905 and possibly #3995.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit af95d934b039d65d3667fc022e2ecaebba107b01)
Yehuda Sadeh [Thu, 7 Feb 2013 00:43:48 +0000 (16:43 -0800)]
rgw: bucket recreation should not clobber bucket info
Fixes: #4039
User's list of buckets is getting modified even if bucket already
exists. This fix removes the newly created directory object, and
makes sure that user info's data points at the correct bucket.
Samuel Just [Fri, 11 Jan 2013 18:44:04 +0000 (10:44 -0800)]
OSD: check for empty command in do_command
Fixes: #3878 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: David Zafman <david.zafman@inktank.com>
(cherry picked from commit 8cf79f252a1bcea5713065390180a36f31d66dfd)
Danny Al-Gaaf [Wed, 30 Jan 2013 17:52:24 +0000 (18:52 +0100)]
PGMap: fix -Wsign-compare warning
Fix -Wsign-compare compiler warning:
mon/PGMap.cc: In member function 'void PGMap::apply_incremental
(CephContext*, const PGMap::Incremental&)':
mon/PGMap.cc:247:30: warning: comparison between signed and
unsigned integer expressions [-Wsign-compare]
Dan Mick [Wed, 30 Jan 2013 07:05:49 +0000 (23:05 -0800)]
cls_rbd, cls_rgw: use PRI*64 when printing/logging 64-bit values
caused segfaults in 32-bit build
Fixes: #3961 Signed-off-by: Dan Mick <dan.mick@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit e253830abac76af03c63239302691f7fac1af381)
Dan Mick [Tue, 29 Jan 2013 23:18:53 +0000 (15:18 -0800)]
init-ceph: make ulimit -n be part of daemon command
ulimit -n from 'max open files' was being set only on the machine
running /etc/init.d/ceph. It needs to be added to the commands to
start the daemons, and run both locally and remotely.
Verified by examining /proc/<pid>/limits on local and remote hosts
Fixes: #3900 Signed-off-by: Dan Mick <dan.mick@inktank.com> Reviewed-by: Loïc Dachary <loic@dachary.org> Reviewed-by: Gary Lowell <gary.lowell@inktank.com>
(cherry picked from commit 84a024b647c0ac2ee5a91bacdd4b8c966e44175c)
Try to share the map with a randomly picked OSD; if the picked monitor is
not 'up', then try to find the nearest 'up' OSD in the map by doing a
backward and a forward linear search on the map -- this would be O(n) in
the worst case scenario, as we only do a single iteration starting on the
picked position, incrementing and decrementing two different iterators
until we find an appropriate OSD or we exhaust the map.
Fixes: #3629
Backport: bobtail
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 3610e72e4f9117af712f34a2e12c5e9537a5746f)
Danny Al-Gaaf [Sun, 27 Jan 2013 20:57:31 +0000 (21:57 +0100)]
utime: fix narrowing conversion compiler warning in sleep()
Fix compiler warning:
./include/utime.h: In member function 'void utime_t::sleep()':
./include/utime.h:139:50: warning: narrowing conversion of
'((utime_t*)this)->utime_t::tv.utime_t::<anonymous struct>::tv_sec' from
'__u32 {aka unsigned int}' to '__time_t {aka long int}' inside { } is
ill-formed in C++11 [-Wnarrowing]
./include/utime.h:139:50: warning: narrowing conversion of
'((utime_t*)this)->utime_t::tv.utime_t::<anonymous struct>::tv_nsec' from
'__u32 {aka unsigned int}' to 'long int' inside { } is
ill-formed in C++11 [-Wnarrowing]
Samuel Just [Fri, 11 Jan 2013 23:00:02 +0000 (15:00 -0800)]
ReplicatedPG: make_snap_collection when moving snap link in snap_trimmer
Backport: bobtail Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 88956e3186798058a1170803f8abfc0f3cf77a07)
Samuel Just [Sat, 12 Jan 2013 00:43:14 +0000 (16:43 -0800)]
ReplicatedPG: correctly handle new snap collections on replica
Backport: bobtail Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 9e44fca13bf1ba39dbcad29111b29f46c49d59f7)
mon: Elector: reset the acked leader when the election finishes and we lost
Failure to do so will mean that we will always ack the same leader during
an election started by another monitor. This had been working so far
because we were still acking the existing leader if he was supposed to
still be the leader; or we were acking a new potentially leader; or we
would eventually fall behind on an election and start a new election
ourselves, thus resetting the previously acked leader. While this wasn't
something that mattered much until now, the timechecks code stumbled into
this tiny issue and was failing hard at completing a round because there
wouldn't be a reset before the election started -- timechecks are bound
to election epochs.
Josh Durgin [Wed, 26 Dec 2012 22:24:22 +0000 (14:24 -0800)]
rbd: fix bench-write infinite loop
I/O was continously submitted as long as there were few enough ops in
flight. If the number of 'threads' was high, or caching was turned on,
there would never be that many ops in flight, so the loop would continue
indefinitely. Instead, submit at most io_threads ops per offset.
Fixes: #3413 Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Dan Mick <dan.mick@inktank.com> Reviewed-by: Sage Weil <sage.weil@inktank.com>
(cherry picked from commit d81ac8418f9e6bbc9adcc69b2e7cb98dd4db6abb)
Josh Durgin [Wed, 2 Jan 2013 22:15:24 +0000 (14:15 -0800)]
librbd: establish watch before reading header
This eliminates a window in which a race could occur when we have an
image open but no watch established. The previous fix (using
assert_version) did not work well with resend operations.
Josh Durgin [Wed, 2 Jan 2013 20:32:33 +0000 (12:32 -0800)]
Revert "librbd: ensure header is up to date after initial read"
Using assert version for linger ops doesn't work with retries,
since the version will change after the first send.
This reverts commit e1776809031c6dad441cfb2b9fac9612720b9083.
Sage Weil [Thu, 24 Jan 2013 06:16:49 +0000 (22:16 -0800)]
os/FileStore: only adjust up op queue for btrfs
We only need to adjust up the op queue limits during commit for btrfs,
because the snapshot initiation (async create) is currently
high-latency and the op queue is quiesced during that period.
This lets us revert 44dca5c, which disabled the extra allowance because
it is generally bad for non-btrfs writeahead mode.
For some reason, the lookup() retry loop (for when happened to
race with a removal and grab an invalid WeakPtr) locked
the lock again. This causes the #3836 crash since the lock
is already locked. It's rare since it requires a lookup between
invalidation of the WeakPtr and removal of the WeakPtr entry.
Samuel Just [Thu, 24 Jan 2013 19:07:37 +0000 (11:07 -0800)]
OSD: use TPHandle in peering_wq
Implement _process overload with TPHandle argument and use
that to ping the hb map between pgs and between map epochs
when advancing a pg. The thread will still timeout if
genuinely stuck at any point.
Samuel Just [Wed, 23 Jan 2013 20:15:10 +0000 (12:15 -0800)]
ReplicatedPG: start scanning omap at omap_recovered_to
Previously, we started scanning omap after omap_recovered_to.
This is a problem since the break in the loop implies that
omap_recovered_to is the first key not recovered.
Sage Weil [Wed, 23 Jan 2013 02:08:22 +0000 (18:08 -0800)]
os/FileStore: add stall injection into filestore op queue
Allow admin to artificially induce a stall in the op queue. Forces the
thread(s) to sleep for N seconds. We pause for 1 second increments and
recheck the value so that a previously stalled thread can be unwedged by
reinjecting a lower value (or 0). To stall indefinitely, just injust
very large number.
Sage Weil [Wed, 23 Jan 2013 02:01:07 +0000 (18:01 -0800)]
osd: hold lock while calling start_boot on startup
This probably doesn't strictly matter because start_boot doesn't need the
lock (currently) and few other threads should be running, but it is
better to be consistent.
Sage Weil [Wed, 23 Jan 2013 01:56:32 +0000 (17:56 -0800)]
osd: do not reply to ping if internal heartbeat is not healthy
If we find that our internal threads are stalled, do not reply to ping
requests. If we do this long enough, peers will mark us down. If we are
only transiently unhealthy, we will reply to the next ping and they will
be satisfied. If we are unhealthy and marked down, and eventually recover,
we will mark ourselves back up.
Sage Weil [Mon, 21 Jan 2013 23:29:28 +0000 (15:29 -0800)]
common/PrioritizedQueue: add min cost, max tokens per bucket
Two problems.
First, we need to cap the tokens per bucket. Otherwise, a stream of
items at one priority over time will indefinitely inflate the tokens
available at another priority. The cap should represent how "bursty"
we allow a given bucket to be. Start with 4MB for now.
Second, set a floor on the item cost. Otherwise, we can have an
infinite queue of 0 cost items that start over queues. More
realistically, we need to balance the overhead of processing small items
with the cost of large items. I.e., a 4 KB item is not 1/1000th as
expensive as a 4MB item.
Sage Weil [Mon, 21 Jan 2013 22:14:25 +0000 (14:14 -0800)]
osd: add OpRequest flag point when commit is sent
With writeahead journaling in particular, we can get requests that
stay in the queue for a long time even after the commit is sent to the
client while we are waiting for the transaction to apply to the fs.
Instead of showing up as 'waiting for subops', make it clear that the
client has gotten its reply and it is local state that is slow.
Sage Weil [Tue, 22 Jan 2013 04:00:26 +0000 (20:00 -0800)]
osd: target transaction size 300 -> 30
Small transactions make pg removal nicer to the op queue. It also slows
down PG deletion a bit, which may exacerbate the PG resurrection case
until #3884 is addressed.
At least on user reported this fixed an osd that kept failing due to
an internal heartbeat failure.
Dan Mick [Tue, 8 Jan 2013 19:21:22 +0000 (11:21 -0800)]
librbd: Allow get_lock_info to fail
If the lock class isn't present, EOPNOTSUPP is returned for lock calls
on newer OSDs, but sadly EIO on older; we need to treat both as
acceptable failures for RBD images. rados lock list will still fail.
Sage Weil [Fri, 4 Jan 2013 21:00:56 +0000 (13:00 -0800)]
osd: drop newlines from event descriptions
These produce extra newlines in the log.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 9a1f574283804faa6dbba9165a40558e1a6a1f13)
Samuel Just [Fri, 18 Jan 2013 22:35:51 +0000 (14:35 -0800)]
OSD: do deep_scrub for repair
Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: David Zafman <david.zafman@inktank.com>
(cherry picked from commit 0cb760f31b0cb26f022fe8b9341e41cd5351afac)
Samuel Just [Thu, 10 Jan 2013 00:41:40 +0000 (16:41 -0800)]
ReplicatedPG: compare nlinks to snapcolls
nlinks gives us the number of hardlinks to the object.
nlinks should be 1 + snapcolls.size(). This will allow
us to detect links which remain in an erroneous snap
collection.
Sage Weil [Tue, 15 Jan 2013 02:31:06 +0000 (18:31 -0800)]
osd: fix rescrub after repair
We were rescrubbing if INCONSISTENT is set, but that is now persistent.
Add a new scrub_after_recovery flag that is reset on each peering interval
and set that when repair encounters errors.
Sage Weil [Mon, 14 Jan 2013 06:04:58 +0000 (22:04 -0800)]
osd: change scrub min/max thresholds
The previous 'osd scrub min interval' was mostly meaningless and useless.
Meanwhile, the 'osd scrub max interval' would only trigger a scrub if the
load was sufficiently low; if it was high, the PG might *never* scrub.
Instead, make the 'min' what the max used to be. If it has been more than
this many seconds, and the load is low, scrub. And add an additional
condition that if it has been more than the max threshold, scrub the PG
no matter what--regardless of the load.
Note that this does not change the default scrub interval for less-loaded
clusters, but it *does* change the meaning of existing config options.
This was already a no-op: we don't call PG::scrub_sched() unless it has
been osd_scrub_max_interval seconds since we last scrubbed. Unless we
explicitly requested in, in which case we don't want this check anyway.