Sage Weil [Mon, 11 Feb 2013 01:02:45 +0000 (17:02 -0800)]
osd: include snaps in pg_log_entry_t::dump()
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 715d8717a0e8a08fbe97a3e7d3ffd33aa9529d90)
Sage Weil [Mon, 11 Feb 2013 00:59:48 +0000 (16:59 -0800)]
osd: unconditionally encode snaps buffer
Previously we would only encode the updated snaps vector for CLONE ops.
This doesn't work for MODIFY ops generated by the snap trimmer, which
may also adjust the clone collections. It is also possible that other
operations may need to populate this field in the future (e.g.,
LOST_REVERT may, although it currently does not).
Fixes: #4071, and possibly #4051. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 54b6dd924fea3af982f3d729150b6449f318daf2)
Sage Weil [Sun, 10 Feb 2013 18:57:12 +0000 (10:57 -0800)]
osd: improve debug output on snap collections
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 8b05492ca5f1479589bb19c1ce058b0d0988b74f)
Samuel Just [Thu, 7 Mar 2013 20:53:51 +0000 (12:53 -0800)]
PG: check_recovery_sources must happen even if not active
missing_loc/missing_loc_sources also must be cleaned up
if a peer goes down during peering:
1) pg is in GetInfo, acting is [3,1]
2) we find object A on osd [0] in GetInfo
3) 0 goes down, no new peering interval since it is neither up nor
acting, but peer_missing[0] is removed.
4) pg goes active and try to pull A from 0 since missing_loc did not get
cleaned up.
Backport: bobtail Fixes: #4371 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit de22b186c497ce151217aecf17a8d35cdbf549bb)
Samuel Just [Sun, 10 Mar 2013 19:50:01 +0000 (12:50 -0700)]
ReplicatedPG: don't leak reservation on removal
Fixes: 4431 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 32bf131e0141faf407b5ff993f75f97516b27c12)
Yehuda Sadeh [Tue, 12 Mar 2013 19:56:01 +0000 (12:56 -0700)]
rgw: set up curl with CURL_NOSIGNAL
Fixes: #4425
Backport: bobtail
Apparently, libcurl needs that in order to be thread safe. Side
effect is that if libcurl is not compiled with c-ares support,
domain name lookups are not going to time out.
Issue affected keystone.
Sage Weil [Fri, 8 Mar 2013 16:56:44 +0000 (08:56 -0800)]
osd: mark down connections from old peers
Close out any connection with an old peer. This avoids a race like:
- peer marked down
- we get map, mark down the con
- they reconnect and try to send us some stuff
- we share our map to tell them they are old and dead, but leave the con
open
...
- peer marks itself up a few times, eventually reuses the same port
- sends messages on their fresh con
- we discard because of our old con
This could cause a tight reconnect loop, but it is better than wrong
behavior.
Other possible fixes:
- make addr nonce truly unique (augment pid in nonce)
- make a smarter 'disposable' msgr state (bleh)
Josh Durgin [Thu, 7 Mar 2013 01:42:03 +0000 (17:42 -0800)]
common: reduce default in-memory logs for non-daemons
The default of 100000 can result in hundreds of MBs of extra memory
used. This was most obvious when using librbd with caching enabled,
since there was a dout(0) accidentally left in the ObjectCacher.
Sage Weil [Sat, 23 Feb 2013 01:01:53 +0000 (17:01 -0800)]
osd: allow (some) log trim when degraded, but not during recovery
We allow some trim during degraded, although we keep more entries around to
improve our chances of a restarting OSD of doing log-based recovery.
Still disallow during recovery...
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 6d89b34e5608c71b49ef33ab58340e90bd8da6e4)
Sage Weil [Mon, 25 Feb 2013 23:33:35 +0000 (15:33 -0800)]
osd: restructure calc_trim
No functional change, except that we log more debug, yay!
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 86df164d04f6e31a0f20bbb94dbce0599c0e8b3d)
Sage Weil [Sat, 23 Feb 2013 00:48:02 +0000 (16:48 -0800)]
osd: allow pg log trim during (non-classic) scrub
Chunky (and deep) scrub do not care about PG log trimming. Classic scrub
still does.
Deep scrub can take a long time, so not trimming the log during that period
may eat lots of RAM; avoid that!
Might fix: #4179 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 0ba8db6b664205348d5499937759916eac0997bf)
Sage Weil [Thu, 28 Feb 2013 20:46:00 +0000 (12:46 -0800)]
msgr: drop messages on cons with CLOSED Pipes
Back in commit 6339c5d43974f4b495f15d199e01a141e74235f5, we tried to make
this deal with a race between a faulting pipe and new messages being
queued. The sequence is
- fault starts on pipe
- fault drops pipe_lock to unregister the pipe
- user (objecter) queues new message on the con
- submit_message reopens a Pipe (due to this bug)
- the message managed to make it out over the wire
- fault finishes faulting, calls ms_reset
- user (objecter) closes the con
- user (objecter) resends everything
It appears as though the previous patch *meant* to drop *m on the floor in
this case, which is what this patch does. And that fixes the crash I am
hitting; see #4271.
Sam Lang [Tue, 12 Feb 2013 17:32:29 +0000 (11:32 -0600)]
deb: Add ceph-coverage to ceph-test deb package
Teuthology uses the ceph-coverage script extensively
and expects it to be installed by the ceph task. Add
the script to the ceph-test debian package so that it
gets installed for that use case.
Yehuda Sadeh [Thu, 7 Mar 2013 03:32:21 +0000 (19:32 -0800)]
rgw: don't iterate through all objects when in namespace
Fixes: #4363
Backport: argonaut, bobtail
When listing objects in namespace don't iterate through all the
objects, only go though the ones that starts with the namespace
prefix
Samuel Just [Thu, 28 Feb 2013 00:58:45 +0000 (16:58 -0800)]
FileJournal::wrap_read_bl: adjust pos before returning
Otherwise, we may feed an offset past the end of the journal to
check_header in read_entry and incorrectly determine that the entry is
corrupt.
Fixes: 4296
Backport: bobtail
Backport: argonaut Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 5d54ab154ca790688a6a1a2ad5f869c17a23980a)
Sage Weil [Mon, 18 Feb 2013 06:35:50 +0000 (22:35 -0800)]
osd: requeue pg waiters at the front of the finished queue
We could have a sequence like:
- op1
- notify
- op2
in the finished queue. Op1 gets put on waiting_for_pg, the notify
creates the pg and requeues op1 (and the end), op2 is handled, and
finally op1 is handled. That breaks ordering; see #2947.
Instead, when we wake up a pg, queue the waiting messages at the front
of the dispatch queue.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 56c5a07708d52de1699585c9560cff8b4e993d0a)
Sage Weil [Mon, 18 Feb 2013 04:49:52 +0000 (20:49 -0800)]
osd: pull requeued requests off one at a time
Pull items off the finished queue on at a time. In certain cases, an
event may result in new items betting added to the finished queue that
will be put at the *front* instead of the back. See latest incarnation
of #2947.
Note that this is a significant changed in behavior in that we can
theoretically starve if an event keeps resulting in new events getting
generated. Beware!
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit f1841e4189fce70ef5722d508289e516faa9af6a)
Sage Weil [Fri, 18 Jan 2013 06:00:42 +0000 (22:00 -0800)]
mds: open mydir after replay
In certain cases, we may replay the journal and not end up with the
dirfrag for mydir open. This is fine--we just need to open it up and
fetch it below.
Greg Farnum [Thu, 21 Feb 2013 17:21:01 +0000 (09:21 -0800)]
mds: use inode_t::layout for dir layout policy
Remove the default_file_layout struct, which was just a ceph_file_layout,
and store it in the inode_t. Rip out all the annoying code that put this
on the heap.
To aid in this usage, add a clear_layout() function to inode_t.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com> Signed-off-by: Greg Farnum <greg@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Sage Weil [Mon, 21 Jan 2013 05:53:37 +0000 (21:53 -0800)]
mds: parse ceph.*.layout vxattr key/value content
Use qi to parse a strictly formatted set of key/value pairs. Be picky
about whitespace. Any subset of recognized keys is allowed. Parse the
same set of keys as the ceph.*.layout.* vxattrs.
Samuel Just [Tue, 19 Feb 2013 18:49:33 +0000 (10:49 -0800)]
PG: remove weirdness log for last_complete < log.tail
In the case of a divergent object prior to log.tail,
last_complete may end up before log.tail.
Backport: bobtail
Fixes #4174 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit dbadb3e2921297882c5836c67ca32bb8ecdc75db)
James Page [Mon, 18 Feb 2013 16:24:54 +0000 (16:24 +0000)]
Strip any trailing whitespace from rbd showmapped
More recent versions of ceph append a bit of whitespace to the line
after the name of the /dev/rbdX device; this causes the monitor check
to fail as it can't find the device name due to the whitespace.
Sage Weil [Thu, 14 Feb 2013 23:39:43 +0000 (15:39 -0800)]
osd/OSDCap: tweak unquoted_word parsing in osd caps
Newer versions of spirit (1.49.0-3.1ubuntu1.1 in quantal, in particular)
dislike the construct with alnum and replace the - and _ with '\0' in the
resulting string.
Fixes: #4122
Backport: bobtail Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 6c504d96c1e4fbb67578fba0666ca453b939c218)
Yehuda Sadeh [Fri, 8 Feb 2013 21:14:49 +0000 (13:14 -0800)]
rgw: change json formatting for swift list container
Fixes: #4048
There is some difference in the way swift formats the
xml output and the json output for list container. In
xml the entity is named 'name' and in json it is named
'subdir'.
Sage Weil [Sat, 9 Feb 2013 05:36:13 +0000 (21:36 -0800)]
java: make CephMountTest use user.* xattr names
Changes to the xattr code in Ceph require
a few tweaks to existing test cases.
Specifically, there is now a ceph.file.layout
xattr by default and user defined xattrs
are prepended with "user."
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Joe Buck <jbbuck@gmail.com> Reviewed-by: Noah Watkins <noahwatkins@gmail.com>
Sage Weil [Fri, 8 Feb 2013 06:06:14 +0000 (22:06 -0800)]
mon: handle -EAGAIN in completion contexts
We can get ECANCELED, EAGAIN, or success out of the completion contexts,
but in the EAGAIN case (meaning there was an election) we were sending
a success to the client. This resulted in client hangs and all-around
confusion when the monitor cluster was thrashing.
Backport: bobtail Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Joao Luis <joao.luis@inktank.com>
(cherry picked from commit 17827769f1fe6d7c4838253fcec3b3a4ad288f41)
Sage Weil [Tue, 12 Feb 2013 22:11:09 +0000 (14:11 -0800)]
osd: only share maps on hb connection of OSD_HBMSGS feature is set
Back in 1bc419a7affb056540ba8f9b332b6ff9380b37af we started sharing maps
with dead osds via the heartbeat connection, but old code will crash on an
unexpected message. Only do this if the OSD_HBMSGS feature is present.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 302b26ff70ee5539da3dcb2e5614e2b7e83b9dcd)
Sage Weil [Tue, 12 Feb 2013 22:10:51 +0000 (14:10 -0800)]
osd: tolerate unexpected messages on the heartbeat interface
We should note but not crash on unexpected messages. Announce this awesome
new "capability" via a feature bit.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit afda30aeaae0a65f83c6886658354ad2b57c4c43)
Samuel Just [Thu, 7 Feb 2013 18:38:00 +0000 (10:38 -0800)]
PG: dirty_info on handle_activate_map
We need to make sure the pg epoch is persisted during
activate_map.
Backport: bobtail Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit dbce1d0dc919e221523bd44e1d0834711da1577d)
Sage Weil [Thu, 7 Feb 2013 18:21:49 +0000 (10:21 -0800)]
osd: flush peering queue (consume maps) prior to boot
If the osd itself is behind on many maps during boot, it will get more and
(as part of that) flush the peering wq to ensure the pgs consume them.
However, it is possible for OSD to have latest/recnet maps, but pgs to be
behind, and to jump directly to boot and join. The OSD is then laggy and
unresponsive because the peering wq is way behind.
To avoid this, call consume_map() (kick the peering wq) at the end of
init and flush it to ensure we are *internally* all caught up before we
consider joining the cluster.
I'm pretty sure this is the root cause of #3905 and possibly #3995.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit af95d934b039d65d3667fc022e2ecaebba107b01)
Yehuda Sadeh [Thu, 7 Feb 2013 00:43:48 +0000 (16:43 -0800)]
rgw: bucket recreation should not clobber bucket info
Fixes: #4039
User's list of buckets is getting modified even if bucket already
exists. This fix removes the newly created directory object, and
makes sure that user info's data points at the correct bucket.
Sage Weil [Fri, 25 Jan 2013 17:29:37 +0000 (09:29 -0800)]
osd: share incoming maps via Connection*, not addrs
Kill a set of parallel methods that are using the old addr/inst-based
msgr APIs, and instead use Connection handles. This is much safer and gets
us closer to killing the old msgr API.
Sage Weil [Fri, 25 Jan 2013 17:27:00 +0000 (09:27 -0800)]
osd: pass new maps to dead osds via existing Connection
Previously we were sending these maps to dead osds via their old addrs
using a new outgoing connection and setting the flags so that the msgr
would clean up. That mechanism is possibly buggy and fragile, and we can
avoid it entirely if we just reuse the existing heartbeat Connection.
Sage Weil [Fri, 25 Jan 2013 17:25:28 +0000 (09:25 -0800)]
osd: requeue osdmaps on heartbeat connections for cluster connection
If we receive an OSDMap on the cluster connection, requeue it for the
cluster messenger, and process it there where we normally do. This avoids
any concerns about locking and ordering rules.
Samuel Just [Fri, 11 Jan 2013 18:44:04 +0000 (10:44 -0800)]
OSD: check for empty command in do_command
Fixes: #3878 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: David Zafman <david.zafman@inktank.com>
(cherry picked from commit 8cf79f252a1bcea5713065390180a36f31d66dfd)
Danny Al-Gaaf [Wed, 30 Jan 2013 17:52:24 +0000 (18:52 +0100)]
PGMap: fix -Wsign-compare warning
Fix -Wsign-compare compiler warning:
mon/PGMap.cc: In member function 'void PGMap::apply_incremental
(CephContext*, const PGMap::Incremental&)':
mon/PGMap.cc:247:30: warning: comparison between signed and
unsigned integer expressions [-Wsign-compare]
Dan Mick [Wed, 30 Jan 2013 07:05:49 +0000 (23:05 -0800)]
cls_rbd, cls_rgw: use PRI*64 when printing/logging 64-bit values
caused segfaults in 32-bit build
Fixes: #3961 Signed-off-by: Dan Mick <dan.mick@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit e253830abac76af03c63239302691f7fac1af381)
Dan Mick [Tue, 29 Jan 2013 23:18:53 +0000 (15:18 -0800)]
init-ceph: make ulimit -n be part of daemon command
ulimit -n from 'max open files' was being set only on the machine
running /etc/init.d/ceph. It needs to be added to the commands to
start the daemons, and run both locally and remotely.
Verified by examining /proc/<pid>/limits on local and remote hosts
Fixes: #3900 Signed-off-by: Dan Mick <dan.mick@inktank.com> Reviewed-by: Loïc Dachary <loic@dachary.org> Reviewed-by: Gary Lowell <gary.lowell@inktank.com>
(cherry picked from commit 84a024b647c0ac2ee5a91bacdd4b8c966e44175c)
Try to share the map with a randomly picked OSD; if the picked monitor is
not 'up', then try to find the nearest 'up' OSD in the map by doing a
backward and a forward linear search on the map -- this would be O(n) in
the worst case scenario, as we only do a single iteration starting on the
picked position, incrementing and decrementing two different iterators
until we find an appropriate OSD or we exhaust the map.
Fixes: #3629
Backport: bobtail
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 3610e72e4f9117af712f34a2e12c5e9537a5746f)
Danny Al-Gaaf [Sun, 27 Jan 2013 20:57:31 +0000 (21:57 +0100)]
utime: fix narrowing conversion compiler warning in sleep()
Fix compiler warning:
./include/utime.h: In member function 'void utime_t::sleep()':
./include/utime.h:139:50: warning: narrowing conversion of
'((utime_t*)this)->utime_t::tv.utime_t::<anonymous struct>::tv_sec' from
'__u32 {aka unsigned int}' to '__time_t {aka long int}' inside { } is
ill-formed in C++11 [-Wnarrowing]
./include/utime.h:139:50: warning: narrowing conversion of
'((utime_t*)this)->utime_t::tv.utime_t::<anonymous struct>::tv_nsec' from
'__u32 {aka unsigned int}' to 'long int' inside { } is
ill-formed in C++11 [-Wnarrowing]
Samuel Just [Fri, 11 Jan 2013 23:00:02 +0000 (15:00 -0800)]
ReplicatedPG: make_snap_collection when moving snap link in snap_trimmer
Backport: bobtail Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 88956e3186798058a1170803f8abfc0f3cf77a07)
Samuel Just [Sat, 12 Jan 2013 00:43:14 +0000 (16:43 -0800)]
ReplicatedPG: correctly handle new snap collections on replica
Backport: bobtail Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 9e44fca13bf1ba39dbcad29111b29f46c49d59f7)
mon: Elector: reset the acked leader when the election finishes and we lost
Failure to do so will mean that we will always ack the same leader during
an election started by another monitor. This had been working so far
because we were still acking the existing leader if he was supposed to
still be the leader; or we were acking a new potentially leader; or we
would eventually fall behind on an election and start a new election
ourselves, thus resetting the previously acked leader. While this wasn't
something that mattered much until now, the timechecks code stumbled into
this tiny issue and was failing hard at completing a round because there
wouldn't be a reset before the election started -- timechecks are bound
to election epochs.
Josh Durgin [Wed, 26 Dec 2012 22:24:22 +0000 (14:24 -0800)]
rbd: fix bench-write infinite loop
I/O was continously submitted as long as there were few enough ops in
flight. If the number of 'threads' was high, or caching was turned on,
there would never be that many ops in flight, so the loop would continue
indefinitely. Instead, submit at most io_threads ops per offset.
Fixes: #3413 Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Dan Mick <dan.mick@inktank.com> Reviewed-by: Sage Weil <sage.weil@inktank.com>
(cherry picked from commit d81ac8418f9e6bbc9adcc69b2e7cb98dd4db6abb)