]> git.apps.os.sepia.ceph.com Git - ceph.git/log
ceph.git
11 years agoosd: throttle snap trimmming with simple delay
Sage Weil [Fri, 18 Apr 2014 20:50:11 +0000 (13:50 -0700)]
osd: throttle snap trimmming with simple delay

This is not particularly smart, but it is *a* knob that lets you make
the snap trimmer slow down.  It's a flow and a simple delay, so it is
adjustable at runtime.  Default is 0 (no change in behavior).

Partial solution for #6278.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 4413670d784efc2392359f0f22bca7c9056188f4)

11 years agoauth: add rwlock to AuthClientHandler to prevent races
Josh Durgin [Wed, 2 Apr 2014 00:27:01 +0000 (17:27 -0700)]
auth: add rwlock to AuthClientHandler to prevent races

For cephx, build_authorizer reads a bunch of state (especially the
current session_key) which can be updated by the MonClient. With no
locks held, Pipe::connect() calls SimpleMessenger::get_authorizer()
which ends up calling RadosClient::get_authorizer() and then
AuthClientHandler::bulid_authorizer(). This unsafe usage can lead to
crashes like:

Program terminated with signal 11, Segmentation fault.
0x00007fa0d2ddb7cb in ceph::buffer::ptr::release (this=0x7f987a5e3070) at common/buffer.cc:370
370 common/buffer.cc: No such file or directory.
in common/buffer.cc
(gdb) bt
0x00007fa0d2ddb7cb in ceph::buffer::ptr::release (this=0x7f987a5e3070) at common/buffer.cc:370
0x00007fa0d2ddec00 in ~ptr (this=0x7f989c03b830) at ./include/buffer.h:171
ceph::buffer::list::rebuild (this=0x7f989c03b830) at common/buffer.cc:817
0x00007fa0d2ddecb9 in ceph::buffer::list::c_str (this=0x7f989c03b830) at common/buffer.cc:1045
0x00007fa0d2ea4dc2 in Pipe::connect (this=0x7fa0c4307340) at msg/Pipe.cc:907
0x00007fa0d2ea7d73 in Pipe::writer (this=0x7fa0c4307340) at msg/Pipe.cc:1518
0x00007fa0d2eb44dd in Pipe::Writer::entry (this=<value optimized out>) at msg/Pipe.h:59
0x00007fa0e0f5f9d1 in start_thread (arg=0x7f987a5e4700) at pthread_create.c:301
0x00007fa0de560b6d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

and

Error in `qemu-system-x86_64': invalid fastbin entry (free): 0x00007ff12887ff20
*** ======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x80a46)[0x7ff3dea1fa46]
/usr/lib/librados.so.2(+0x29eb03)[0x7ff3e3d43b03]
/usr/lib/librados.so.2(_ZNK9CryptoKey7encryptEP11CephContextRKN4ceph6buffer4listERS4_RSs+0x71)[0x7ff3e3d42661]
/usr/lib/librados.so.2(_Z21encode_encrypt_enc_blIN4ceph6buffer4listEEvP11CephContextRKT_RK9CryptoKeyRS2_RSs+0xfe)[0x7ff3e3d417de]
/usr/lib/librados.so.2(_Z14encode_encryptIN4ceph6buffer4listEEiP11CephContextRKT_RK9CryptoKeyRS2_RSs+0xa2)[0x7ff3e3d41912]
/usr/lib/librados.so.2(_ZN19CephxSessionHandler12sign_messageEP7Message+0x242)[0x7ff3e3d40de2]
/usr/lib/librados.so.2(_ZN4Pipe6writerEv+0x92b)[0x7ff3e3e61b2b]
/usr/lib/librados.so.2(_ZN4Pipe6Writer5entryEv+0xd)[0x7ff3e3e6c7fd]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7f8e)[0x7ff3ded6ff8e]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7ff3dea99a0d]

Fix this by adding an rwlock to AuthClientHandler. A simpler fix would
be to move RadosClient::get_authorizer() into the MonClient() under
the MonClient lock, but this would not catch all uses of other
Authorizer, e.g. for verify_authorizer() and it would serialize
independent connection attempts.

This mainly matters for cephx, but none and unknown can have the
global_id reset as well.

Partially-fixes: #6480
Backport: dumpling, emperor
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 2cc76bcd12d803160e98fa73810de2cb916ef1ff)

11 years agopipe: only read AuthSessionHandler under pipe_lock
Josh Durgin [Tue, 1 Apr 2014 18:37:29 +0000 (11:37 -0700)]
pipe: only read AuthSessionHandler under pipe_lock

session_security, the AuthSessionHandler for a Pipe, is deleted and
recreated while the pipe_lock is held. read_message() is called
without pipe_lock held, and examines session_security. To make this
safe, make session_security a shared_ptr and take a reference to it
while the pipe_lock is still held, and use that shared_ptr in
read_message().

This may have caused crashes like:

*** Error in `qemu-system-x86_64': invalid fastbin entry (free): 0x00007f42a4002de0 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x80a46)[0x7f452f1f3a46]
/usr/lib/x86_64-linux-gnu/libnss3.so(PK11_FreeSymKey+0xa8)[0x7f452e72ff98]
/usr/lib/librados.so.2(+0x2a18cd)[0x7f453451a8cd]
/usr/lib/librados.so.2(_ZNK9CryptoKey7encryptEP11CephContextRKN4ceph6buffer4listERS4_RSs+0x71)[0x7f4534519421]
/usr/lib/librados.so.2(_Z21encode_encrypt_enc_blIN4ceph6buffer4listEEvP11CephContextRKT_RK9CryptoKeyRS2_RSs+0xfe)[0x7f453451859e]
/usr/lib/librados.so.2(_Z14encode_encryptIN4ceph6buffer4listEEiP11CephContextRKT_RK9CryptoKeyRS2_RSs+0xa2)[0x7f45345186d2]
/usr/lib/librados.so.2(_ZN19CephxSessionHandler23check_message_signatureEP7Message+0x246)[0x7f4534516866]
/usr/lib/librados.so.2(_ZN4Pipe12read_messageEPP7Message+0xdcc)[0x7f453462ecbc]
/usr/lib/librados.so.2(_ZN4Pipe6readerEv+0xa5c)[0x7f453464059c]
/usr/lib/librados.so.2(_ZN4Pipe6Reader5entryEv+0xd)[0x7f4534643ecd]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7f8e)[0x7f452f543f8e]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f452f26da0d]

Partially-fixes: #6480
Backport: dumpling, emperor
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 1d74170a4c252f35968ccfbec8e432582e92f638)

11 years agoosdc/ObjectCacher: back off less during flush
Sage Weil [Fri, 3 Jan 2014 20:51:15 +0000 (12:51 -0800)]
osdc/ObjectCacher: back off less during flush

In cce990efc8f2a58c8d0fa11c234ddf2242b1b856 we added a limit to avoid
holding the lock for too long.  However, if we back off, we currently
wait for a full second, which is probably a bit much--we really just want
to give other threads a chance.

Backport: emperor
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit e2ee52879e9de260abbf5eacbdabbd71973a6a83)

11 years agoos/FileStore: reset journal state on umount
Sage Weil [Tue, 8 Apr 2014 17:52:43 +0000 (10:52 -0700)]
os/FileStore: reset journal state on umount

We observed a sequence like:

 - replay journal
   - sets JournalingObjectStore applied_op_seq
 - umount
 - mount
   - initiate commit with prevous applied_op_seq
 - replay journal
   - commit finishes
   - on replay commit, we fail assert op > committed_seq

Although strictly speaking the assert failure is harmless here, in general
we should not let state leak through from a previous mount into this
mount or else assertions are in general more difficult to reason about.

Fixes: #8019
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 4de49e8676748b6ab4716ff24fd0a465548594fc)

11 years agomon: wait for quorum for MMonGetVersion
Sage Weil [Sat, 5 Apr 2014 23:58:55 +0000 (16:58 -0700)]
mon: wait for quorum for MMonGetVersion

We should not respond to checks for map versions when we are in the
probing or electing states or else clients will get incorrect results when
they ask what the latest map version is.

Fixes: #7997
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 67fd4218d306c0d2c8f0a855a2e5bf18fa1d659e)

11 years agorgw: fix swift range response
Yehuda Sadeh [Wed, 19 Feb 2014 16:59:07 +0000 (08:59 -0800)]
rgw: fix swift range response

Fixes: #7099
Backport: dumpling
The range response header was broken in swift.

Reported-by: Julien Calvet <julien.calvet@neurea.com>
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit 0427f61544529ab4e0792b6afbb23379fe722de1)

11 years agorgw: don't log system requests in usage log
Yehuda Sadeh [Fri, 22 Nov 2013 23:41:49 +0000 (15:41 -0800)]
rgw: don't log system requests in usage log

Fixes: 6889
System requets should not be logged in the usage log.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit 42ef8ba543c7bf13c5aa3b6b4deaaf8a0f9c58b6)

11 years agoPrioritizedQueue: cap costs at max_tokens_per_subqueue
Samuel Just [Thu, 13 Mar 2014 21:04:19 +0000 (14:04 -0700)]
PrioritizedQueue: cap costs at max_tokens_per_subqueue

Otherwise, you can get a recovery op in the queue which has a cost
higher than the max token value.  It won't get serviced until all other
queues also do not have enough tokens and higher priority queues are
empty.

Fixes: #7706
Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 2722a0a487e77ea2aa0d18caec0bdac50cb6a264)

11 years agoFix byte-order dependency in calculation of initial challenge
Dan Mick [Thu, 3 Apr 2014 20:59:59 +0000 (13:59 -0700)]
Fix byte-order dependency in calculation of initial challenge

Fixes: #7977
Signed-off-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 4dc62669ecd679bc4d0ef2b996b2f0b45b8b4dc7)

11 years agorbd.cc: tolerate lack of NUL-termination on block_name_prefix
Dan Mick [Wed, 26 Mar 2014 00:09:48 +0000 (17:09 -0700)]
rbd.cc: tolerate lack of NUL-termination on block_name_prefix

Fixes: #7577
Signed-off-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit fd76fec589be13a4a6362ef388929d3e3d1d21f6)

11 years agomon: MonCommands.h: have 'auth' read-only operations require 'x' cap
Joao Eduardo Luis [Thu, 3 Apr 2014 17:21:08 +0000 (18:21 +0100)]
mon: MonCommands.h: have 'auth' read-only operations require 'x' cap

This reintroduces the same semantics that were in place in dumpling prior
to the refactoring of the cap/command matching code.

We haven't added this requirement to auth read-write operations as that
would have the potential to break a lot of well-configured keyrings once
the users upgraded, without any significant gain -- we assume that if
they have set 'rw' caps on a given entity, they are indeed expecting said
entity to be sort-of-privileged entities with regard to monitor access.

Fixes: #7919
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit db266a3fb2985605738201f59f07fa504c91c770)

Conflicts:

doc/release-notes.rst

11 years agotest: use older names for module setup/teardown
Josh Durgin [Thu, 21 Nov 2013 02:35:34 +0000 (18:35 -0800)]
test: use older names for module setup/teardown

setUp and tearDown require nosetests 0.11, but 0.10.4 is the latest on
centos. Rename to use the older aliases, which still work with newer
versions of nosetests as well.

Fixes: #6368
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit f753d56a9edba6ce441520ac9b52b93bd8f1b5b4)

11 years agoMerge pull request #1589 from ceph/wip-7805-emperor
Samuel Just [Wed, 2 Apr 2014 17:44:07 +0000 (10:44 -0700)]
Merge pull request #1589 from ceph/wip-7805-emperor

PG: only complete replicas should count toward min_size

Reviewed-by: Samuel Just <sam.just@inktank.com>
11 years agoRevert "PG: only complete replicas should count toward min_size"
Sage Weil [Tue, 1 Apr 2014 22:52:52 +0000 (15:52 -0700)]
Revert "PG: only complete replicas should count toward min_size"

This reverts commit b097a237e11b48c47d3fd5484f3449e683e95db0.

This causes us to crash, probably because whoami is not in peer_info but
may be in want.

11 years agoPG: only complete replicas should count toward min_size 1589/head
Sage Weil [Tue, 1 Apr 2014 23:01:28 +0000 (16:01 -0700)]
PG: only complete replicas should count toward min_size

Backport: emperor,dumpling,cuttlefish
Fixes: #7805
Signed-off-by: Samuel Just <sam.just@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoosd: do not send peering messages during init
Sage Weil [Sun, 5 Jan 2014 06:39:35 +0000 (22:39 -0800)]
osd: do not send peering messages during init

Do not send any peering messages while we are still working our way
through init().

Fixes: #7093
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 35da8f9d80e0c6c33fb6c6e00f0bf38f1eb87d0e)

11 years agoclient: pin Inode during readahead
Sage Weil [Thu, 27 Mar 2014 04:52:00 +0000 (21:52 -0700)]
client: pin Inode during readahead

Make sure the Inode does not go away while a readahead is in progress.  In
particular:

 - read_async
   - start a readahead
   - get actual read from cache, return
 - close/release
   - call ObjectCacher::release_set() and get unclean > 0, assert

Fixes: #7867
Backport: emperor, dumpling
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit f1c7b4ef0cd064a9cb86757f17118d17913850db)

11 years agoosdc/ObjectCacher: call read completion even when no target buffer
Sage Weil [Fri, 28 Mar 2014 19:34:07 +0000 (12:34 -0700)]
osdc/ObjectCacher: call read completion even when no target buffer

If we do no assemble a target bl, we still want to return a valid return
code with the number of bytes read-ahead so that the C_RetryRead completion
will see this as a finish and call the caller's provided Context.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 032d4ec53e125ad91ad27ce58da6f38dcf1da92e)

11 years agoMerge pull request #1517 from ceph/wip-7805
Sage Weil [Fri, 28 Mar 2014 20:25:47 +0000 (13:25 -0700)]
Merge pull request #1517 from ceph/wip-7805

PG: only complete replicas should count toward min_size

Reviewed-by: Sage Weil <sage@inktank.com>
11 years agoMerge pull request #1538 from ceph/wip-6910-emperor
Sage Weil [Thu, 27 Mar 2014 00:18:36 +0000 (17:18 -0700)]
Merge pull request #1538 from ceph/wip-6910-emperor

PG: don't query unfound on empty pgs

11 years agoPG: don't query unfound on empty pgs 1538/head
Samuel Just [Wed, 27 Nov 2013 03:17:59 +0000 (19:17 -0800)]
PG: don't query unfound on empty pgs

When the replica responds, it responds with a notify
rather than a log, which the primary then ignores since
it is already in the peer_info map.  Rather than fix that
we'll simply not send queries to peers we already know to
have no unfound objects.

Fixes: #6910
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: David Zafman <david.zafman@inktank.com>
(cherry picked from commit 838b6c8387087543ce50837277f7f6b52ae87d00)

11 years agobuild: add gf-complete/jerasure to gitmodule_mirrors
Loic Dachary [Tue, 18 Mar 2014 21:04:13 +0000 (22:04 +0100)]
build: add gf-complete/jerasure to gitmodule_mirrors

Signed-off-by: Loic Dachary <loic@dachary.org>
11 years agoAdd file to store mirror location of module's.
Sandon Van Ness [Mon, 17 Mar 2014 19:15:32 +0000 (12:15 -0700)]
Add file to store mirror location of module's.

Signed-off-by: Sandon Van Ness <sandon@inktank.com>
11 years agoPG: only complete replicas should count toward min_size 1517/head
Samuel Just [Fri, 21 Mar 2014 20:24:15 +0000 (13:24 -0700)]
PG: only complete replicas should count toward min_size

Backport: emperor,dumpling,cuttlefish
Fixes: #7805
Signed-off-by: Samuel Just <sam.just@inktank.com>
11 years agoMerge pull request #1484 from ceph/wip-7212.emperor
Sage Weil [Fri, 21 Mar 2014 21:52:12 +0000 (14:52 -0700)]
Merge pull request #1484 from ceph/wip-7212.emperor

backport 7212 fixes to emperor

11 years agomon/Elector: bootstrap on timeout 1484/head
Sage Weil [Sat, 15 Feb 2014 16:59:51 +0000 (08:59 -0800)]
mon/Elector: bootstrap on timeout

Currently if an election times out we call a new
election.  If we have never joined a quorum, bootstrap
instead. This is heavier weight, but captures the case
where, during bootstrap:

 - a and b have learned each others' addresses
 - everybody calls an election
 - a and b form a quorum
 - c loops trying to call an election, but is ignored
   because a and b don't see its address in the monmap

See logs:
  ubuntu@teuthology:/var/lib/teuthworker/archive/sage-2014-02-14_13:50:04-ceph-deploy-wip-7212-sage-b-testing-basic-plana/83194

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit a4bcb1f8129a4ece97bd3419abf1ff45d260ad8e)

11 years agomon: tell MonmapMonitor first about winning an election
Sage Weil [Fri, 14 Feb 2014 19:25:52 +0000 (11:25 -0800)]
mon: tell MonmapMonitor first about winning an election

It is important in the bootstrap case that the very first paxos round
also codify the contents of the monmap itself in order to avoid any manner
of confusing scenarios where subsequent elections are called and people
try to recover and modify paxos without agreeing on who the quorum
participants are.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit ad7f5dd481a7f45dfe6b50d27ad45abc40950510)

11 years agoMerge pull request #1483 from ceph/wip-7611.emperor
Sage Weil [Mon, 17 Mar 2014 15:19:30 +0000 (08:19 -0700)]
Merge pull request #1483 from ceph/wip-7611.emperor

mon: Monitor: make sure received command is valid before handling it

Reviewed-by: Sage Weil <sage@inktank.com>
11 years agomon: only learn peer addresses when monmap == 0
Sage Weil [Fri, 14 Feb 2014 19:13:26 +0000 (11:13 -0800)]
mon: only learn peer addresses when monmap == 0

It is only safe to dynamically update the address for a peer mon in our
monmap if we are in the midst of the initial quorum formation (i.e.,
monmap.epoch == 0).  If it is a later epoch, we have formed our initial
quorum and any and all monmap changes need to be agreed upon by the quorum
and committed via paxos.

Fixes: #7212
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 7bd2104acfeff0c9aa5e648d82ed372f901f767f)

11 years agoceph.in: do not allow using 'tell' with interactive mode 1483/head
Joao Eduardo Luis [Mon, 17 Mar 2014 14:37:09 +0000 (14:37 +0000)]
ceph.in: do not allow using 'tell' with interactive mode

This avoids a lot of hassle when dealing with to whom tell each command
on interactive mode, and even more so if multiple targets are specified.

As so, 'tell' commands should be used while on interactive mode instead.

Backport: dumpling,emperor

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit e39c213c1d230271d23b74086664c2082caecdb9)

11 years agomon: Monitor: make sure received command is valid before handling it
Joao Eduardo Luis [Tue, 11 Mar 2014 16:38:10 +0000 (09:38 -0700)]
mon: Monitor: make sure received command is valid before handling it

Fixes: 7611
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
11 years agoMonitor: pull command mapping out of _allowed_command()
Greg Farnum [Sat, 7 Dec 2013 00:09:36 +0000 (16:09 -0800)]
Monitor: pull command mapping out of _allowed_command()

We want to be able to validate commands against both the leader and
local command sets, so make that functionality generic.

Signed-off-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit f932903646b4549fe7c2f519a18518c6c81f3ef9)

11 years agoRGWListBucketMultiparts: init max_uploads/default_max with 0
Danny Al-Gaaf [Wed, 12 Mar 2014 21:56:44 +0000 (22:56 +0100)]
RGWListBucketMultiparts: init max_uploads/default_max with 0

CID 717377 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
 2. uninit_member: Non-static class member "max_uploads" is not initialized
    in this constructor nor in any functions that it calls.
 4. uninit_member: Non-static class member "default_max" is not initialized
    in this constructor nor in any functions that it calls.

Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
(cherry picked from commit b23a141d54ffb39958aba9da7f87544674fa0e50)

11 years agoceph_test_rados: wait for commit, not ack
Sage Weil [Thu, 13 Mar 2014 21:49:30 +0000 (14:49 -0700)]
ceph_test_rados: wait for commit, not ack

First, this is what we wanted in the first place

Second, if we wait for ACK, we may look at a user_version value that is
not stable.

Fixes: #7705
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit f2124c5846f1e9cb44e66eb2e957b8c7df3e19f4)

Conflicts:

src/test/osd/RadosModel.h

11 years agoMerge pull request #1406 from dachary/wip-7188-emperor
Sage Weil [Sun, 9 Mar 2014 17:56:41 +0000 (10:56 -0700)]
Merge pull request #1406 from dachary/wip-7188-emperor

common: ping existing admin socket before unlink (emperor)

Reviewed-by: Sage Weil <sage@inktank.com>
11 years agocommon: ping existing admin socket before unlink 1247/head 1406/head
Loic Dachary [Sat, 15 Feb 2014 10:43:13 +0000 (11:43 +0100)]
common: ping existing admin socket before unlink

When a daemon initializes it tries to create an admin socket and unlinks
any pre-existing file, regardless. If such a file is in use, it causes
the existing daemon to loose its admin socket.

The AdminSocketClient::ping is implemented to probe an existing socket,
using the "0" message. The AdminSocket::bind_and_listen function is
modified to call ping() on when it finds existing file. It unlinks the
file only if the ping fails.

http://tracker.ceph.com/issues/7188 fixes: #7188

Backport: emperor, dumpling
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Loic Dachary <loic@dachary.org>
(cherry picked from commit 45600789f1ca399dddc5870254e5db883fb29b38)

11 years agoMerge pull request #1365 from ceph/wip-6820.emperor
Sage Weil [Wed, 5 Mar 2014 22:19:14 +0000 (14:19 -0800)]
Merge pull request #1365 from ceph/wip-6820.emperor

mon: OSDMonitor: don't crash if formatter is invalid during osd crush dump

11 years agomon: OSDMonitor: don't crash if formatter is invalid during osd crush dump 1365/head
Joao Eduardo Luis [Fri, 22 Nov 2013 02:17:16 +0000 (02:17 +0000)]
mon: OSDMonitor: don't crash if formatter is invalid during osd crush dump

Code would assume a formatter would always be defined. If a 'plain'
formatter or even an invalid formatter were to be supplied, the monitor
would crash and burn in poor style.

Fixes: 6820
Backport: emperor

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit 49d2fb71422fe4edfe5795c001104fb5bc8c98c3)

11 years agolibrbd: remove limit on number of objects in the cache
Josh Durgin [Tue, 11 Feb 2014 18:14:36 +0000 (10:14 -0800)]
librbd: remove limit on number of objects in the cache

The number of objects is not a significant indicated of when data
should be written out for rbd. Use the highest possible value for
number of objects and just rely on the dirty data limits to trigger
flushing. When the number of objects is low, and many start being
flushed before they accumulate many requests, it hurts average request
size and performance for many concurrent sequential writes.

Fixes: #7385
Backport: emperor, dumpling
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 0559d31db29ea83bdb6cec72b830d16b44e3cd35)

11 years agoObjectCacher: use uint64_t for target and max values
Josh Durgin [Tue, 11 Feb 2014 19:53:00 +0000 (11:53 -0800)]
ObjectCacher: use uint64_t for target and max values

All the options are uint64_t, but the ObjectCacher was converting them
to int64_t. There's never any reason for these to be negative, so
change the type.

Adjust a few conditionals so that they only convert known-positive
signed values to uint64_t before comparing with the target and max
values. Leave the actual stats accounting as loff_t for now, since
bugs in accounting will have bad effects if negative values wrap
around.

Backport: emperor, dumpling
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit db034acf546a72739ff6543241543f3bd651f3ae)

11 years agoObjectCacher: remove max_bytes and max_ob arguments to trim()
Josh Durgin [Tue, 11 Feb 2014 18:35:14 +0000 (10:35 -0800)]
ObjectCacher: remove max_bytes and max_ob arguments to trim()

These are never passed, so replace them with the defaults.

Backport: emperor, dumpling
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit bf8cf2d6d21a204a099347f3dcd5b48100b8c445)

11 years agoMerge pull request #1208 from dachary/emperor
Sage Weil [Tue, 11 Feb 2014 16:31:46 +0000 (08:31 -0800)]
Merge pull request #1208 from dachary/emperor

common: admin socket fallback to json-pretty format (emperor)

Reviewed-by: Sage Weil <sage@inktank.com>
11 years agocommon: admin socket fallback to json-pretty format 1208/head
Loic Dachary [Mon, 10 Feb 2014 22:42:38 +0000 (23:42 +0100)]
common: admin socket fallback to json-pretty format

If the format argument to a command sent to the admin socket is not
among the supported formats ( json, json-pretty, xml, xml-pretty ) the
new_formatter function will return null and the AdminSocketHook::call
function must fall back to a sensible default.

The CephContextHook::call and HelpHook::call failed to do that and a
malformed format argument would cause the mon to crash. A check is added
to each of them and fallback to json-pretty if the format is not
recognized.

To further protect AdminSocketHook::call implementations from similar
problems the format argument is checked immediately after accepting the
command in AdminSocket::do_accept and replaced with json-pretty if it is
not known.

A test case is added for both CephContextHook::call and HelpHook::call
to demonstrate the problem exists and is fixed by the patch.

Three other instances of unsafe calls to new_formatter were found and
a fallback to json-pretty was added. All other calls have been audited
and appear to be safe.

http://tracker.ceph.com/issues/7378 fixes #7378

Signed-off-by: Loic Dachary <loic@dachary.org>
(cherry picked from commit 165e76d4d03ffcc490fd3c2ba60fb37372990d0a)

11 years agoqa: add script for testing rados client timeout options
Josh Durgin [Thu, 6 Feb 2014 01:26:02 +0000 (17:26 -0800)]
qa: add script for testing rados client timeout options

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 9e62beb80b6c92a97ec36c0db5ea39e417661b35)

11 years agorados: check return values for commands that can now fail
Josh Durgin [Thu, 6 Feb 2014 01:25:24 +0000 (17:25 -0800)]
rados: check return values for commands that can now fail

A few places were not checking the return values of commands, since
they could not fail before timeouts were added.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 79c1874346ff55e2dc74ef860db16ce70242fd00)

11 years agolibrados: check and return on error so timeouts work
Josh Durgin [Thu, 6 Feb 2014 01:24:16 +0000 (17:24 -0800)]
librados: check and return on error so timeouts work

Some functions could not previously return errors, but they had an
int return value, which can now receive ETIMEDOUT.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 8e9459e897b1bc2f97d52ee07701fd22069efcf3)

11 years agomsg/Pipe: add option to restrict delay injection to specific msg type
Josh Durgin [Thu, 6 Feb 2014 01:22:14 +0000 (17:22 -0800)]
msg/Pipe: add option to restrict delay injection to specific msg type

This makes it possible to test timeouts reliably by delaying certain
messages effectively forever, but still being able to e.g. connect and
authenticate to the monitors.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit d389e617c1019e44848330bf9570138ac7b0e5d4)

11 years agoMonClient: add a timeout on commands for librados
Josh Durgin [Tue, 4 Feb 2014 02:30:00 +0000 (18:30 -0800)]
MonClient: add a timeout on commands for librados

Just use the conf option directly, since librados is the only caller.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 671a76d64bc50e4f15f4c2804d99887e22dcdb69)

11 years agoObjecter: implement mon and osd operation timeouts
Josh Durgin [Tue, 4 Feb 2014 01:59:21 +0000 (17:59 -0800)]
Objecter: implement mon and osd operation timeouts

This captures almost all operations from librados other than mon_commands().

Get the values for the timeouts from the Objecter constructor, so only
librados uses them.

Add C_Cancel_*_Op, finish_*_op(), and *_op_cancel() for each type of
operation, to mirror those for Op. Create a callback and schedule it
in the existing timer thread if the timeouts are specified.

Fixes: #6507
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 3e1f7bbb4217d322f4e0ece16e676cd30ee42a20)

Conflicts:
src/osd/ReplicatedPG.cc

11 years agolibrados: add timeout to wait_for_osdmap()
Josh Durgin [Mon, 3 Feb 2014 20:53:15 +0000 (12:53 -0800)]
librados: add timeout to wait_for_osdmap()

This is used by several pool operations independent of the objecter,
including rados_ioctx_create() to look up the pool id in the first
osdmap.

Unfortunately we can't just rely on WaitInterval returning ETIMEDOUT,
since it may also get interrupted by a signal, so we can't avoid
keeping track of time explicitly here.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 1829d2c9fd13f2cbae4e192c9feb553047dad42c)

Conflicts:
src/librados/RadosClient.cc

11 years agoconf: add options for librados timeouts
Josh Durgin [Mon, 3 Feb 2014 20:09:34 +0000 (12:09 -0800)]
conf: add options for librados timeouts

These will be implemented in subsequent patches.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 0dcceff1378d85ca6d81d102d201890b8a71af6b)

11 years agoclient: use 64-bit value in sync read eof logic
Sage Weil [Mon, 3 Feb 2014 16:54:14 +0000 (08:54 -0800)]
client: use 64-bit value in sync read eof logic

The file size can jump to a value that is very much larger than our current
position (for example, it could be a disk image file that gets a sparse
write at a large offset).  Use a 64-bit value so that 'some' doesn't
overflow.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: John Spray <john.spray@inktank.com>
(cherry picked from commit 7ff2b541c24d1c81c3bcfbcb347694c2097993d7)

11 years agoOSDMap: fix deepish_copy_from
Sage Weil [Wed, 29 Jan 2014 02:46:37 +0000 (18:46 -0800)]
OSDMap: fix deepish_copy_from

Start with a shallow copy!

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit d0f13f54146694a197535795da15b8832ef4b56f)

Conflicts:

src/osd/OSDMap.h

11 years agoosd: OSDMonitor: ignore pgtemps from removed pool
Joao Eduardo Luis [Tue, 28 Jan 2014 15:54:33 +0000 (15:54 +0000)]
osd: OSDMonitor: ignore pgtemps from removed pool

There's a window in-between receiving an MOSDPGTemp message from an OSD
and actually handling it that may lead to the pool the pg temps refer to
no longer existing. This may happen if the MOSDPGTemp message is queued
pending dispatching due to an on-going proposal (maybe even the pool
removal).

This patch fixes such behavior in two steps:

1. Check if the pool exists in the osdmap upon preprocessing
 - if pool does not exist in the osdmap, then the pool must have been
   removed prior to handling the message, but after the osd sent it.
 - safe to ignore the pg update
2. If all pg updates in the message have been ignored, ignore the whole
   message.  Otherwise, let prepare handle the rest.

3. Recheck if pool exists in the osdmap upon prepare
 - We may have ignored this pg back in preprocess, but other pgs in the
   message may have led the message to be passed on to prepare; ignore
   pg update once more.
4. Check if pool is pending removal and ignore pg update if so.

We delegate checking the pending value to prepare_pgtemp() because in this
case we should only ignore the update IFF the pending value is in fact
committed.  Otherwise we should retry the message.  prepare_pgtemp() is
the appropriate place to do so.

Fixes: 7116
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit f513f66f48383a07c70ca18a4dba6c2449ea9860)

11 years agoOSDMonitor: use deepish_copy_from for remove_down_pg_temp
Sage Weil [Tue, 28 Jan 2014 19:00:34 +0000 (11:00 -0800)]
OSDMonitor: use deepish_copy_from for remove_down_pg_temp

This is a backport of 368852f6c0a884b8fdc80a5cd6f9ab72e814d412.

Make a deep copy of the OSDMap to avoid clobbering the in-memory copy with
the call to apply_incremental.

Fixes: #7060
Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoOSDMap: deepish_copy_from()
Sage Weil [Fri, 24 Jan 2014 19:03:26 +0000 (11:03 -0800)]
OSDMap: deepish_copy_from()

Make a deep(ish) copy of another OSDMap.  Unfortunatley we can't make the
compiler-generated copy operator/constructors private until c++11.  :(

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit bd54b9841b9255406e56cdc7269bddb419453304)

11 years agobuffer: make 0-length splice() a no-op
Sage Weil [Tue, 28 Jan 2014 18:26:12 +0000 (10:26 -0800)]
buffer: make 0-length splice() a no-op

This was causing a problem in the Striper, but fixing it here will avoid
corner cases all over the tree.  Note that we have to bail out before
the end-of-buffer check to avoid hitting that check when the bufferlist is
also empty.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit ff5abfbdae07ae8a56fa83ebaa92000896f793c2)

11 years agoosdc/Striper: test zero-length add_partial_result
Sage Weil [Tue, 28 Jan 2014 18:09:17 +0000 (10:09 -0800)]
osdc/Striper: test zero-length add_partial_result

If we add a partial result that is 0-length, we used to hit an assert in
buffer::list::splice().  Add a unit test to verify the fix.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 28c7388d320a47657c2e12c46907f1bf40672b08)

11 years agopackaging: apply udev hack rule to RHEL
Derek Yarnell [Mon, 27 Jan 2014 19:27:51 +0000 (12:27 -0700)]
packaging: apply udev hack rule to RHEL

In the RPM spec file there is a test to deploy the uuid hack udev rules
for older udev operating systems. This includes CentOS and RHEL, but the
check currently only is for CentOS, causing RHEL clients to get a bogus
osd rules file.

Adjust the conditional to apply to RHEL as well as CentOS. (The %{rhel}
macro is defined in both platforms' redhat-rpm-config package.)

Fixes http://tracker.ceph.com/issues/7245

Signed-off-by: Ken Dreyer <ken.dreyer@inktank.com>
(cherry picked from commit 64a0b4fa563795bc22753940aa3a4a2946113109)

11 years agomon/MDSMonitor: do not generate mdsmaps from already-laggy mds
Sage Weil [Tue, 21 Jan 2014 19:29:56 +0000 (11:29 -0800)]
mon/MDSMonitor: do not generate mdsmaps from already-laggy mds

There is one path where a mds that is not sending its beacon (e.g.,
because it is not running at all) will lead to proposal of new mdsmaps.
Fix it.

Backport: emperor, dumpling
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 584c2dd6bea3fe1a3c7f306874c054ce0cf0d2b5)

11 years agoFix #7187: Include all summary items in JSON health output
John Spray [Mon, 20 Jan 2014 11:08:27 +0000 (11:08 +0000)]
Fix #7187: Include all summary items in JSON health output

Signed-off-by: John Spray <john.spray@inktank.com>
(cherry picked from commit fdf3b5520d150f14d90bdfc569b70c07b0579b38)

11 years agoqa/workunits/cephtool/test.sh: hashpspool takes int in emperor
Sage Weil [Fri, 10 Jan 2014 19:33:42 +0000 (11:33 -0800)]
qa/workunits/cephtool/test.sh: hashpspool takes int in emperor

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agorgw: convert bucket info if needed
Yehuda Sadeh [Tue, 7 Jan 2014 02:32:42 +0000 (18:32 -0800)]
rgw: convert bucket info if needed

Fixes: #7110
In dumpling, the bucket info was separated into bucket entry point and
bucket instance objects. When setting bucket attrs we only ended up
updating the bucket instance object. However, pre-dumpling buckets still
keep everything at the entry-point object, so acl changes didn't affect
anything (because we never updated the entry point). This change just
converts the bucket info into the new format.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit a5f8cc7ec9ec8bef4fbc656066b4d3a08e5b215b)

11 years agoosd: ignore OSDMap messages while we are initializing
Sage Weil [Sun, 5 Jan 2014 06:40:43 +0000 (22:40 -0800)]
osd: ignore OSDMap messages while we are initializing

The mon may occasionally send OSDMap messages to random OSDs, but is not
very descriminating in that we may not have authenticated yet.  Ignore any
messages if that is the case; we will reqeust whatever we need during the
BOOTING state.

Fixes: #7093
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit f68de9f352d53e431b1108774e4a23adb003fe3f)

11 years agomon: only send messages to current OSDs
Sage Weil [Sun, 5 Jan 2014 06:43:26 +0000 (22:43 -0800)]
mon: only send messages to current OSDs

When choosing a random OSD to send a message to, verify not only that
the OSD id is up but that the session is for the same instance of that OSD
by checking that the address matches.

Fixes: #7093
Backport: emperor, dumpling
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 98ed9ac5fed6eddf68f163086df72faabd9edcde)

11 years agoqa: test for error when ceph osd rm is EBUSY
Loic Dachary [Sun, 15 Dec 2013 21:59:51 +0000 (22:59 +0100)]
qa: test for error when ceph osd rm is EBUSY

http://tracker.ceph.com/issues/6824 fixes #6824

Signed-off-by: Loic Dachary <loic@dachary.org>
(cherry picked from commit 31507c90f0161c4569a2cc634c0b5f671179440a)

11 years agomon: set ceph osd (down|out|in|rm) error code on failure
Loic Dachary [Sun, 15 Dec 2013 15:27:02 +0000 (16:27 +0100)]
mon: set ceph osd (down|out|in|rm) error code on failure

Instead of always returning true, the error code is set if at least one
operation fails.

EINVAL if the OSD id is invalid (osd.foobar for instance).
EBUSY if trying to remove and OSD that is up.

When used with the ceph command line, it looks like this:

    ceph -c ceph.conf osd rm osd.0
    Error EBUSY: osd.0 is still up; must be down before removal.
    kill PID_OF_osd.0
    ceph -c ceph.conf osd down osd.0
    marked down osd.0.
    ceph -c ceph.conf osd rm osd.0 osd.1
    Error EBUSY: removed osd.0, osd.1 is still up; must be down before removal.

http://tracker.ceph.com/issues/6824 fixes #6824

Signed-off-by: Loic Dachary <loic@dachary.org>
(cherry picked from commit 15b8616b13a327701c5d48c6cb7aeab8fcc4cafc)

11 years agomon: OSDMonitor: fix some annoying whitespace
Joao Eduardo Luis [Tue, 29 Oct 2013 20:30:37 +0000 (20:30 +0000)]
mon: OSDMonitor: fix some annoying whitespace

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit 42c4137cbfacad5654f02c6608cc0e81b45c06be)

11 years agolibrbd: call user completion after incrementing perfcounters
Josh Durgin [Fri, 27 Dec 2013 01:38:52 +0000 (17:38 -0800)]
librbd: call user completion after incrementing perfcounters

The perfcounters (and the ictx) are only valid while the image is
still open.  If the librbd user gets the callback for its last I/O,
then closes the image, the ictx and its perfcounters will be
invalid. If the AioCompletion object is has not run the rest of its
complete() method yet, it will access these now-invalid addresses,
possibly leading to a crash.

The AioCompletion object is independent of the ictx and does not
access it again after incrementing perfcounters, so avoid this race by
calling the user's callback after this step. The AioCompletion object
will be cleaned up by the rest of complete_request(), independent of
the ImageCtx.

Fixes: #5426
Backport: dumpling, emperor
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 4cea7895da7331b84d8c6079851fdc0ff2f4afb1)

11 years agoobjecter: don't take extra throttle budget for resent ops
Josh Durgin [Sat, 7 Dec 2013 00:03:20 +0000 (16:03 -0800)]
objecter: don't take extra throttle budget for resent ops

These ops have already taken their budget in the original op_submit().
It will be returned via put_op_budget() when they complete.
If there were many localized reads of missing objects from replicas,
or cache pool redirects, this would cause the objecter to use up all
of its op throttle budget and hang.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 8d0180b1b7b48662daef199931efc7f2a6a1c431)

11 years agorbd: check write return code during bench-write
Josh Durgin [Fri, 6 Dec 2013 01:44:37 +0000 (17:44 -0800)]
rbd: check write return code during bench-write

This is allows rbd-bench to detect http://tracker.ceph.com/issues/6938
when combined with rapidly changing the mon osd full ratio.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 3caf3effcb113f843b54e06099099909eb335453)

11 years agoobjecter: resend all writes after osdmap loses the full flag
Josh Durgin [Fri, 6 Dec 2013 01:36:33 +0000 (17:36 -0800)]
objecter: resend all writes after osdmap loses the full flag

Now that the osd does not respond if it gets a map with the full flag
set first, clients need to resend all writes.

Clients talking to old osds are still subject to the race condition,
so both sides must be upgraded to avoid it.

Refs: #6938
Backport: dumpling, emperor
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit e32874fc5aa6f59494766b7bbeb2b6ec3d8f190e)

11 years agoosd: drop writes when full instead of returning an error
Josh Durgin [Fri, 6 Dec 2013 01:34:38 +0000 (17:34 -0800)]
osd: drop writes when full instead of returning an error

There's a race between the client and osd with a newly marked full
osdmap.  If the client gets the new map first, it blocks writes and
everything works as expected, with no errors from the osd.

If the osd gets the map first, however, it will respond to any writes
with -ENOSPC. Clients will pass this up the stack, and not retry these
writes later.  -ENOSPC isn't handled well by all clients. RBD, for
example, may pass it on to qemu or kernel rbd which will both
interpret it as EIO.  Filesystems on top of rbd will not behave well
when they receive EIOs like this, especially if the cluster oscillates
between full and not full, so some writes succeed.

To fix this, never return ENOSPC from the osd because of a map marked
full, and rely on the client to retry all writes when the map is no
longer marked full.

Old clients talking to osds with this fix will hang instead of
propagating an error, but only if they run into this race
condition. ceph-fuse and rbd with caching enabled are not affected,
since the ObjectCacher will retry writes that return errors.

Refs: #6938
Backport: dumpling, emperor
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 4111729dda7437c23f59e7100b3c4a9ec4101dd0)

11 years agoobjecter: clean pause / unpause logic
Yehuda Sadeh [Thu, 7 Nov 2013 00:55:52 +0000 (16:55 -0800)]
objecter: clean pause / unpause logic

op->paused holds now whether operation should be paused or not, and it's
being updated when scanning requests. No need to do a second scan.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit 5fe3dc647bf936df8e1eb2892b53f44f68f19821)

11 years agoobjecter: set op->paused in recalc_op_target(), resend in not paused
Yehuda Sadeh [Thu, 7 Nov 2013 00:15:47 +0000 (16:15 -0800)]
objecter: set op->paused in recalc_op_target(), resend in not paused

When going through scan_requests() in handle_osd_map() we need to make
sure that if an op should not be paused anymore then set it on the op
itself, and return NEED_RESEND. Otherwise we're going to miss reset of
the full flag.
Also in handle_osd_map(), make sure that op shouldn't be paused before
sending it. There's a lot of cleanup around that area that we should
probably be doing, make the code much more tight.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit 98ab7d64a191371fa39d840c5f8e91cbaaa1d7b7)

11 years agoobjecter: don't resend paused ops
Josh Durgin [Wed, 6 Nov 2013 02:46:37 +0000 (10:46 +0800)]
objecter: don't resend paused ops

Paused ops are meant to block on the client side until a new map that
unpauses them is recieved. If we send paused writes when the FULL flag
is set, we'll get -ENOSPC from the osds, which is not what Objecter
users expect. This may cause rbd without caching to produce an I/O
error instead of waiting for the cluster to have capacity.

Fixes: #6725
Backport: dumpling
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit c5c399d327cfc0d232d9ec7d49ababa914d0b21a)

11 years agov0.72.2 v0.72.2
Gary Lowell [Fri, 20 Dec 2013 19:28:37 +0000 (19:28 +0000)]
v0.72.2

11 years agorgw: fix use-after-free when releasing completion handle
Yehuda Sadeh [Wed, 18 Dec 2013 21:11:01 +0000 (13:11 -0800)]
rgw: fix use-after-free when releasing completion handle

Backport: emperor, dumpling
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit c8890ab2d46fe8e12200a0d2f9eab31c461fb871)

11 years agorgw: don't return data within the librados cb
Yehuda Sadeh [Wed, 18 Dec 2013 21:10:21 +0000 (13:10 -0800)]
rgw: don't return data within the librados cb

Fixes: #7030
The callback is running within a single Finisher thread, thus we
shouldn't block there. Append read data to a list and flush it within
the iterate context.

Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit d6a4f6adfaa75c3140d07d6df7be03586cc16183)

11 years agoPartial revert "mon: osd pool set syntax relaxed, modify unit tests"
Sage Weil [Mon, 2 Dec 2013 06:21:31 +0000 (22:21 -0800)]
Partial revert "mon: osd pool set syntax relaxed, modify unit tests"

This reverts commit 08327fed8213a5d24cd642e12b38a171b98924cb, except
for the hashpspool bit.  We switched back to an integer argument in
commit 337195f04653eed8e8f153a5b074f3bd48408998.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit e80ab94bf44e102fcd87d16dc11e38ca4c0eeadb)
Reviewed-by: Greg Farnum <greg@inktank.com>
11 years agomon: OSDMonitor: drop cmdval_get() for unused variable
Joao Eduardo Luis [Tue, 19 Nov 2013 23:21:11 +0000 (23:21 +0000)]
mon: OSDMonitor: drop cmdval_get() for unused variable

We don't ever use any value as a float, so just drop obtaining it.  This
makes it easier to partially revert 2fe0d0d9 in an upcoming patch.

Backport: emperor

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit 7191bb2b2485c7819ca7b9d9434d803d0c94db7a)
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
11 years agomon: OSDMonitor: receive CephInt on 'osd pool set' instead on CephString
Joao Eduardo Luis [Fri, 22 Nov 2013 02:10:35 +0000 (02:10 +0000)]
mon: OSDMonitor: receive CephInt on 'osd pool set' instead on CephString

This partially reverts 2fe0d0d9 in order to allow Emperor monitors to
forward mon command messages to Dumpling monitors without breaking a
cluster.

The need for this patch became obvious after issue #6796 was triggered.
Basically, in a mixed cluster of Emperor/Dumpling monitors, if a client
happens to obtain the command descriptions from an Emperor monitor and
then issue an 'osd pool set' this can turn out in one of two ways:

1. client msg gets forwarded to an Emperor leader and everything's a-okay;
2. client msg gets forwarded to a Dumpling leader and the string fails to
be interpreted without the monitor noticing, thus leaving the monitor with
an uninitialized variable leading to trouble.

If 2 is triggered, a multitude of bad things can happen, such as thousands
of pg splits, due to a simple 'osd set pool foo pg_num 128' turning out
to be interpreted as 109120394 or some other random number.

This patch is such that we make sure the client sends an integer instead
of a string. We also make sure to interpret anything the client sends as
possibly being a string, or an integer.

Fixes: 6796
Backport: emperor

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit 337195f04653eed8e8f153a5b074f3bd48408998)
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
11 years agoinit, upstart: prevent daemons being started by both
Josh Durgin [Mon, 25 Nov 2013 21:43:43 +0000 (13:43 -0800)]
init, upstart: prevent daemons being started by both

There can be only one init system starting a daemon. If there is a
host entry in ceph.conf for a daemon, sysvinit would try to start it
even if the daemon's directory did not include a sysvinit file. This
preserves backwards compatibility with older installs using sysvinit,
but if an upstart file is present in the daemon's directory, upstart
will try to start them, regardless of host entries in ceph.conf.

If there's an upstart file in a daemon's directory and a host entry
for that daemon in ceph.conf, both sysvinit and upstart would attempt
to manage it.

Fix this by only starting daemons if the marker file for the other
init system is not present. This maintains backwards compatibility
with older installs using neither sysvinit or upstart marker files,
and does not break any valid configurations. The only configuration
that would break is one with both sysvinit and upstart files present
for the same daemon.

Backport: emperor, dumpling
Reported-by: Tim Spriggs <tims@uahirise.org>
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 5e34beb61b3f5a1ed4afd8ee2fe976de40f95ace)

11 years agorgw: don't error out on empty owner when setting acls
Yehuda Sadeh [Wed, 27 Nov 2013 21:34:00 +0000 (13:34 -0800)]
rgw: don't error out on empty owner when setting acls

Fixes: #6892
Backport: dumpling, emperor
s3cmd specifies empty owner field when trying to set acls on object
/ bucket. We errored out as it didn't match the current owner name, but
with this change we ignore it.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit 14cf4caff58cc2c535101d48c53afd54d8632104)

11 years agorgw: lower some debug message
Yehuda Sadeh [Fri, 22 Nov 2013 15:04:01 +0000 (07:04 -0800)]
rgw: lower some debug message

Fixes: #6084
Backport: dumpling, emperor

Reported-by: Ron Allred <rallred@itrefined.com>
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit b35fc1bc2ec8c8376ec173eb1c3e538e02c1694e)

11 years agoReplicatedPG: test for missing head before find_object_context
Samuel Just [Tue, 12 Nov 2013 23:15:26 +0000 (15:15 -0800)]
ReplicatedPG: test for missing head before find_object_context

find_object_context doesn't return EAGAIN for a missing head.
I chose not to change that behavior since it might hide bugs
in the future.  All other callers check for missing on head
before calling into find_object_context because we potentially
need head or snapdir to map a snapid onto a clone.

Backport: emperor
Fixes: 6758
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: David Zafman <david.zafman@inktank.com>
(cherry picked from commit dd9d8b020286d5e3a69455023c3724a7b436d687)

11 years agoosd: fix bench block size
Josh Durgin [Mon, 18 Nov 2013 22:39:12 +0000 (14:39 -0800)]
osd: fix bench block size

The command was declared to take 'size' in dumpling, but was trying to
read 'bsize' instead, so it always used the default of 4MiB. Change
the bench command to read 'size', so it matches what existing clients
are sending.

Fixes: #6795
Backport: emperor, dumpling
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 40a76ef0d09f8ecbea13712410d9d34f25b91935)

11 years agov0.72.1 v0.72.1
Gary Lowell [Fri, 15 Nov 2013 06:25:42 +0000 (06:25 +0000)]
v0.72.1

11 years agoceph-filestore-tool: add tool for fixing lost objects
Samuel Just [Wed, 13 Nov 2013 22:39:26 +0000 (14:39 -0800)]
ceph-filestore-tool: add tool for fixing lost objects

Used to repair: #6761
Backport: emperor
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
11 years agoosd_types: fix object_info_t backwards compatibility
Samuel Just [Wed, 13 Nov 2013 21:24:10 +0000 (13:24 -0800)]
osd_types: fix object_info_t backwards compatibility

Shipping an object_info_t to a replica with the dirty
flag set would cause the replica to interpret that
object as being lost.  Instead, we always encode
lost into the slot where dumpling expects to find
it and add another field at the end of the encoding.

Backport: emperor
Fixes: #6761
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
11 years agov0.72 v0.72
Gary Lowell [Thu, 7 Nov 2013 20:27:35 +0000 (20:27 +0000)]
v0.72

11 years agorgw: deny writes to a secondary zone by non-system users
Yehuda Sadeh [Tue, 5 Nov 2013 22:54:20 +0000 (14:54 -0800)]
rgw: deny writes to a secondary zone by non-system users

Fixes: #6678
We don't want to allow regular users to write to secondary zones,
otherwise we'd end up with data inconsistencies.

Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
11 years agodoc/release-notes: note crush update timeout on startup change
Sage Weil [Thu, 7 Nov 2013 04:02:09 +0000 (20:02 -0800)]
doc/release-notes: note crush update timeout on startup change

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoosdmaptool: fix cli tests
Sage Weil [Thu, 7 Nov 2013 03:59:56 +0000 (19:59 -0800)]
osdmaptool: fix cli tests

From c22c84a88c22688b6044ab37f65a3fe40dfe1983.

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoCeph: Fix memory leak in chain_flistxattr()
Li Wang [Thu, 7 Nov 2013 02:44:30 +0000 (10:44 +0800)]
Ceph: Fix memory leak in chain_flistxattr()

Free allocated memory before return.

Signed-off-by: Li Wang <liwang@ubuntukylin.com>
Reviewed-by: Sage Weil <sage@inktank.com>
11 years agoReplicatedPG: don't skip missing if sentries is empty on pgls
Samuel Just [Wed, 6 Nov 2013 22:33:03 +0000 (14:33 -0800)]
ReplicatedPG: don't skip missing if sentries is empty on pgls

Formerly, if sentries is empty, we skip missing.  In general,
we need to continue adding items from missing until we get
to next (returned from collection_list_partial) to avoid
missing any objects.

Fixes: #6633
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: David Zafman <david.zafman@inktank.com>
11 years agoPG: fix operator<<,log_wierdness log bound warning
Samuel Just [Wed, 6 Nov 2013 05:48:53 +0000 (21:48 -0800)]
PG: fix operator<<,log_wierdness log bound warning

Split may cause holes such that head != tail and yet
log.empty().

Fixes: #6722
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: David Zafman <david.zafman@inktank.com>
11 years agoPGLog::rewind_divergent_log: log may not contain newhead
Samuel Just [Wed, 6 Nov 2013 01:47:48 +0000 (17:47 -0800)]
PGLog::rewind_divergent_log: log may not contain newhead

Due to split, there may be a hole at newhead.

Fixes: #6722
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: David Zafman <david.zafman@inktank.com>
11 years agoMerge pull request #824 from dmick/next
Sage Weil [Wed, 6 Nov 2013 15:46:02 +0000 (07:46 -0800)]
Merge pull request #824 from dmick/next

osdmaptool: don't put progress on stdout

Reviewed-by: Sage Weil <sage@inktank.com>