]> git.apps.os.sepia.ceph.com Git - ceph.git/log
ceph.git
12 years agomsgr: don't queue message on closed pipe
Sage Weil [Sun, 23 Dec 2012 17:19:05 +0000 (09:19 -0800)]
msgr: don't queue message on closed pipe

If we have a con that refs a pipe but it is closed, don't use it.  If
the ref is still there, it is only because we are racing with fault()
and it is about to (or just was) be detached.  Either way,

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 6339c5d43974f4b495f15d199e01a141e74235f5)

12 years agomsgr: atomically queue first message with connect_rank
Sage Weil [Sun, 23 Dec 2012 05:24:52 +0000 (21:24 -0800)]
msgr: atomically queue first message with connect_rank

Atomically queue the first message on the new pipe, without dropping
and retaking pipe_lock.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 7bf0b0854d1f2706a3a2302bcbf92dd5c8c888ef)

12 years agoconfig_opts.h: default osd_recovery_delay_start to 0
Samuel Just [Thu, 10 Jan 2013 19:06:02 +0000 (11:06 -0800)]
config_opts.h: default osd_recovery_delay_start to 0

This setting was intended to prevent recovery from overwhelming peering traffic
by delaying the recovery_wq until osd_recovery_delay_start seconds after pgs
stop being added to it.  This should be less necessary now that recovery
messages are sent with strictly lower priority then peering messages.

Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Gregory Farnum <greg@inktank.com>
(cherry picked from commit 44625d4460f61effe2d63d8280752f10f159e7b4)

12 years agoosdmaptool: more fix cli test
Sage Weil [Thu, 17 Jan 2013 05:19:18 +0000 (21:19 -0800)]
osdmaptool: more fix cli test

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit b0162fab3d927544885f2b9609b9ab3dc4aaff74)

12 years agoosdmaptool: fix cli test
Sage Weil [Thu, 17 Jan 2013 05:10:26 +0000 (21:10 -0800)]
osdmaptool: fix cli test

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 5bd8765c918174aea606069124e43c480c809943)

12 years agoosdmaptool: allow user to specify pool for test-map-object
Samuel Just [Wed, 16 Jan 2013 22:21:47 +0000 (14:21 -0800)]
osdmaptool: allow user to specify pool for test-map-object

Fixes: #3820
Backport: bobtail
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Gregory Farnum <greg@inktank.com>
(cherry picked from commit 85eb8e382a26dfc53df36ae1a473185608b282aa)

12 years agorados.cc: fix rmomapkey usage: val not needed
David Zafman [Wed, 16 Jan 2013 20:41:16 +0000 (12:41 -0800)]
rados.cc: fix rmomapkey usage: val not needed

Signed-off-by: David Zafman <david.zafman@inktank.com>
Reviewed-by: Samuel Just <samuel.just@inktank.com>
(cherry picked from commit 625c3cb9b536a0cff7249b8181b7a4f09b1b4f4f)

12 years agolibrados.hpp: fix omap_get_vals and omap_get_keys comments
Samuel Just [Wed, 16 Jan 2013 05:27:23 +0000 (21:27 -0800)]
librados.hpp: fix omap_get_vals and omap_get_keys comments

We list keys greater than start_after.

Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: David Zafman <david.zafman@inktank.com>
(cherry picked from commit 3f0ad497b3c4a5e9bef61ecbae5558ae72d4ce8b)

12 years agorados.cc: use omap_get_vals_by_keys in getomapval
Samuel Just [Wed, 16 Jan 2013 05:26:22 +0000 (21:26 -0800)]
rados.cc: use omap_get_vals_by_keys in getomapval

Fixes: #3811
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: David Zafman <david.zafman@inktank.com>
(cherry picked from commit cb5e2be418924cf8b2c6a6d265a7a0327f08d00a)

12 years agorados.cc: fix listomapvals usage: key,val are not needed
Samuel Just [Wed, 16 Jan 2013 05:24:50 +0000 (21:24 -0800)]
rados.cc: fix listomapvals usage: key,val are not needed

Fixes: #3812
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: David Zafman <david.zafman@inktank.com>
(cherry picked from commit 44c45e520cc2e60c6c803bb245edb9330bff37e4)

12 years agorgw: copy object should not copy source acls
Yehuda Sadeh [Wed, 16 Jan 2013 23:01:47 +0000 (15:01 -0800)]
rgw: copy object should not copy source acls

Fixes: #3802
Backport: argonaut, bobtail

When using the S3 api and x-amz-metadata-directive is
set to COPY we used to copy complete metadata of source
object. However, this shouldn't include the source ACLs.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit 37dbf7d9df93dd0e92019be31eaa1a19dd9569c7)

12 years agoOSD: only trim up to the oldest map still in use by a pg
Samuel Just [Fri, 11 Jan 2013 19:02:15 +0000 (11:02 -0800)]
OSD: only trim up to the oldest map still in use by a pg

map_cache.cached_lb() provides us with a lower bound across
all pgs for in-use osdmaps.  We cannot trim past this since
those maps are still in use.

backport: bobtail
Fixes: #3770
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 66eb93b83648b4561b77ee6aab5b484e6dba4771)

12 years agoRevert "osdmap: spread replicas across hosts with default crush map"
Sage Weil [Mon, 14 Jan 2013 16:15:02 +0000 (08:15 -0800)]
Revert "osdmap: spread replicas across hosts with default crush map"

This reverts commit 503917f0049d297218b1247dc0793980c39195b3.

This breaks vstart and teuthology configs.  A better fix is coming.

12 years agomon: OSDMonitor: don't output to stdout in plain text if json is specified
Joao Eduardo Luis [Thu, 10 Jan 2013 18:54:12 +0000 (18:54 +0000)]
mon: OSDMonitor: don't output to stdout in plain text if json is specified

Fixes: #3748
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 410906e04936c935903526f26fb7db16c412a711)

12 years agoosdmap: spread replicas across hosts with default crush map
Sage Weil [Sat, 12 Jan 2013 01:23:22 +0000 (17:23 -0800)]
osdmap: spread replicas across hosts with default crush map

This is more often the case than not, and we don't have a good way to
magically know what size of cluster the user will be creating.  Better to
err on the side of doing the right thing for more people.

Fixes: #3785
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 7ea5d84fa3d0ed3db61eea7eb9fa8dbee53244b6)

12 years agoReplicatedPG: fix snapdir trimming
Samuel Just [Thu, 10 Jan 2013 03:17:23 +0000 (19:17 -0800)]
ReplicatedPG: fix snapdir trimming

The previous logic was both complicated and not correct.  Consequently,
we have been tending to drop snapcollection links in some cases.  This
has resulted in clones incorrectly not being trimmed.  This patch
replaces the logic with something less efficient but hopefully a bit
clearer.

Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 0f42c37359d976d1fe90f2d3b877b9b0268adc0b)

12 years agov0.56.1 v0.56.1
Gary Lowell [Mon, 7 Jan 2013 21:33:30 +0000 (13:33 -0800)]
v0.56.1

12 years agomsg/Pipe: prepare Message data for wire under pipe_lock
Sage Weil [Sun, 6 Jan 2013 16:38:27 +0000 (08:38 -0800)]
msg/Pipe: prepare Message data for wire under pipe_lock

We cannot trust the Message bufferlists or other structures to be
stable without pipe_lock, as another Pipe may claim and modify the sent
list items while we are writing to the socket.

Related to #3678.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit d16ad9263d7b1d3c096f56c56e9631fae8509651)

12 years agomsgr: update Message envelope in encode, not write_message
Sage Weil [Sun, 6 Jan 2013 16:33:01 +0000 (08:33 -0800)]
msgr: update Message envelope in encode, not write_message

Fill out the Message header, footer, and calculate CRCs during
encoding, not write_message().  This removes most modifications from
Pipe::write_message().

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 40706afc66f485b2bd40b2b4b1cd5377244f8758)

12 years agomsg/Pipe: encode message inside pipe_lock
Sage Weil [Sun, 6 Jan 2013 16:25:40 +0000 (08:25 -0800)]
msg/Pipe: encode message inside pipe_lock

This modifies bufferlists in the Message struct, and it is possible
for multiple instances of the Pipe to get references on the Message;
make sure they don't modify those bufferlists concurrently.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 4cfc4903c6fb130b6ac9105baf1f66fbda797f14)

12 years agomsg/Pipe: associate sending msgs to con inside lock
Sage Weil [Sat, 5 Jan 2013 18:39:08 +0000 (10:39 -0800)]
msg/Pipe: associate sending msgs to con inside lock

Associate a sending message with the connection inside the pipe_lock.
This way if a racing thread tries to steal these messages it will
be sure to reset the con point *after* we do such that it the con
pointer is valid in encode_payload() (and later).

This may be part of #3678.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit a058f16113efa8f32eb5503d5443aa139754d479)

12 years agomsg/Pipe: fix msg leak in requeue_sent()
Sage Weil [Sat, 5 Jan 2013 17:29:50 +0000 (09:29 -0800)]
msg/Pipe: fix msg leak in requeue_sent()

The sent list owns a reference to each message.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 2a1eb466d3f8e25ec8906b3ca6118a14c4e269d2)

12 years agoosdc/Objecter: fix linger_ops iterator invalidation on pool deletion
Sage Weil [Mon, 7 Jan 2013 20:58:39 +0000 (12:58 -0800)]
osdc/Objecter: fix linger_ops iterator invalidation on pool deletion

The call to check_linger_pool_dne() may unregister the linger request,
invalidating the iterator.  To avoid this, increment the iterator at
the top of the loop.

This mirror the fix in 4bf9078286d58c2cd4e85cb8b31411220a377092 for
regular non-linger ops.

Fixes: #3734
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 62586884afd56f2148205bdadc5a67037a750a9b)

12 years agoos/FileJournal: include limits.h
Sage Weil [Sun, 6 Jan 2013 04:53:49 +0000 (20:53 -0800)]
os/FileJournal: include limits.h

Needed for IOV_MAX.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit ce49968938ca3636f48fe543111aa219f36914d8)

12 years agoosd: special case CALL op to not have RD bit effects
Sage Weil [Sat, 5 Jan 2013 01:43:41 +0000 (17:43 -0800)]
osd: special case CALL op to not have RD bit effects

In commit 20496b8d2b2c3779a771695c6f778abbdb66d92a we treat a CALL as
different from a normal "read", but we did not adjust the behavior
determined by the RD bit in the op.  We tried to fix that in
91e941aef9f55425cc12204146f26d79c444cfae, but changing the op code breaks
compatibility, so that was reverted.

Instead, special-case CALL in the helper--the only point in the code that
actually checks for the RD bit.  (And fix one lingering user to use that
helper appropriately.)

Fixes: #3731
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit 988a52173522e9a410ba975a4e8b7c25c7801123)

12 years agoRevert "OSD: remove RD flag from CALL ops"
Sage Weil [Sat, 5 Jan 2013 04:46:48 +0000 (20:46 -0800)]
Revert "OSD: remove RD flag from CALL ops"

This reverts commit 91e941aef9f55425cc12204146f26d79c444cfae.

We cannot change this op code without breaking compatibility
with old code (client and server).  We'll have to special case
this op code instead.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit d3abd0fe0bb402ff403259d4b1a718a56331fc39)

12 years agoReplicatedPG: remove old-head optization from push_to_replica
Samuel Just [Fri, 4 Jan 2013 20:43:52 +0000 (12:43 -0800)]
ReplicatedPG: remove old-head optization from push_to_replica

This optimization allowed the primary to push a clone as a single push in the
case that the head object on the replica is old and happens to be at the same
version as the clone.  In general, using head in clone_subsets is tricky since
we might be writing to head during the push.  calc_clone_subsets does not
consider head (probably for this reason).  Handling the clone from head case
properly would require blocking writes on head in the interim which is probably
a bad trade off anyway.

Because the old-head optimization only comes into play if the replica's state
happens to fall on the last write to head prior to the snap that caused the
clone in question, it's not worth the complexity.

Fixes: #3698
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit e89b6ade63cdad315ab754789de24008cfe42b37)

12 years agoos/FileStore: fix non-btrfs op_seq commit order
Sage Weil [Fri, 4 Jan 2013 01:15:07 +0000 (17:15 -0800)]
os/FileStore: fix non-btrfs op_seq commit order

The op_seq file is the starting point for journal replay.  For stable btrfs
commit mode, which is using a snapshot as a reference, we should write this
file before we take the snap.  We normally ignore current/ contents anyway.

On non-btrfs file systems, however, we should only write this file *after*
we do a full sync, and we should then fsync(2) it before we continue
(and potentially trim anything from the journal).

This fixes a serious bug that could cause data loss and corruption after
a power loss event.  For a 'kill -9' or crash, however, there was little
risk, since the writes were still captured by the host's cache.

Fixes: #3721
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 28d59d374b28629a230d36b93e60a8474c902aa5)

12 years agoOSD: for old osds, dispatch peering messages immediately
Samuel Just [Thu, 3 Jan 2013 17:59:45 +0000 (09:59 -0800)]
OSD: for old osds, dispatch peering messages immediately

Normally, we batch up peering messages until the end of
process_peering_events to allow us to combine many notifies, etc
to the same osd into the same message.  However, old osds assume
that the actiavtion message (log or info) will be _dispatched
before the first sub_op_modify of the interval.  Thus, for those
peers, we need to send the peering messages before we drop the
pg lock, lest we issue a client repop from another thread before
activation message is sent.

Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 4ae4dce5c5bb547c1ff54d07c8b70d287490cae9)

12 years agoosd: move common active vs booting code into consume_map
Sage Weil [Thu, 3 Jan 2013 06:38:53 +0000 (22:38 -0800)]
osd: move common active vs booting code into consume_map

Push osdmaps to PGs in separate method from activate_map() (whose name
is becoming less and less accurate).

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit a32d6c5dca081dcd8266f4ab51581ed6b2755685)

12 years agoosd: let pgs process map advances before booting
Sage Weil [Thu, 3 Jan 2013 06:20:06 +0000 (22:20 -0800)]
osd: let pgs process map advances before booting

The OSD deliberate consumes and processes most OSDMaps from while it
was down before it marks itself up, as this is can be slow.  The new
threading code does this asynchronously in peering_wq, though, and
does not let it drain before booting the OSD.  The OSD can get into
a situation where it marks itself up but is not responsive or useful
because of the backlog, and only makes the situation works by
generating more osdmaps as result.

Fix this by calling activate_map() even when booting, and when booting
draining the peering_wq on each call.  This is harmless since we are
not yet processing actual ops; we only need to be async when active.

Fixes: #3714
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 0bfad8ef2040a0dd4a0dc1d3abf3ab5b2019d179)

12 years agolog: broadcast cond signals
Sage Weil [Fri, 28 Dec 2012 21:07:18 +0000 (13:07 -0800)]
log: broadcast cond signals

We were using a single cond, and only signalling one waiter.  That means
that if the flusher and several logging threads are waiting, and we hit
a limit, we the logger could signal another logger instead of the flusher,
and we could deadlock.

Similarly, if the flusher empties the queue, it might signal only a single
logger, and that logger could re-signal the flusher, and the other logger
could wait forever.

Intead, break the single cond into two: one for loggers, and one for the
flusher.  Always signal the (one) flusher, and always broadcast to all
loggers.

Backport: bobtail, argonaut
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit 813787af3dbb99e42f481af670c4bb0e254e4432)

12 years agolog: fix locking typo/stupid for dump_recent()
Sage Weil [Wed, 2 Jan 2013 21:58:44 +0000 (13:58 -0800)]
log: fix locking typo/stupid for dump_recent()

We weren't locking m_flush_mutex properly, which in turn was leading to
racing threads calling dump_recent() and garbling the crash dump output.

Backport: bobtail, argonaut
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit 43cba617aa0247d714632bddf31b9271ef3a1b50)

12 years agotest_filejournal: optionally specify journal filename as an argument
Sage Weil [Sat, 29 Dec 2012 00:48:22 +0000 (16:48 -0800)]
test_filejournal: optionally specify journal filename as an argument

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 483c6f76adf960017614a8641c4dcdbd7902ce33)

12 years agotest_filejournal: test journaling bl with >IOV_MAX segments
Sage Weil [Sat, 29 Dec 2012 00:48:05 +0000 (16:48 -0800)]
test_filejournal: test journaling bl with >IOV_MAX segments

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit c461e7fc1e34fdddd8ff8833693d067451df906b)

12 years agoos/FileJournal: limit size of aio submission
Sage Weil [Sat, 29 Dec 2012 00:47:28 +0000 (16:47 -0800)]
os/FileJournal: limit size of aio submission

Limit size of each aio submission to IOV_MAX-1 (to be safe).  Take care to
only mark the last aio with the seq to signal completion.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit dda7b651895ab392db08e98bf621768fd77540f0)

12 years agoos/FileJournal: logger is optional
Sage Weil [Fri, 28 Dec 2012 23:44:51 +0000 (15:44 -0800)]
os/FileJournal: logger is optional

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 076b418c7f03c5c62f811fdc566e4e2b776389b7)

12 years agov0.56 v0.56
Gary Lowell [Tue, 1 Jan 2013 01:10:11 +0000 (17:10 -0800)]
v0.56

12 years agoMerge remote-tracking branch 'gh/wip-rbd-unprotect' into next
Sage Weil [Sun, 30 Dec 2012 23:29:37 +0000 (15:29 -0800)]
Merge remote-tracking branch 'gh/wip-rbd-unprotect' into next

Reviewed-by: Sage Weil <sage@inktank.com>
12 years agodoc: fix rbd permissions for unprotect
Josh Durgin [Sun, 30 Dec 2012 07:57:01 +0000 (23:57 -0800)]
doc: fix rbd permissions for unprotect

Unprotect examines all pools, so use blanket x before 0.54. After
that, use class-read restricted by object_prefix to rbd_children.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agolibrbd: fix race between unprotect and clone
Josh Durgin [Sun, 30 Dec 2012 04:35:15 +0000 (20:35 -0800)]
librbd: fix race between unprotect and clone

Clone needs to actually re-read the header to make sure the image is
still protected before returning. Additionally, it needs to consider
the image protected *only* if the protection status is protected -
unprotecting does not count. I thought I'd already fixed this, but
can't find the commit.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agorbd: open (source) image as read-only
Josh Durgin [Sun, 30 Dec 2012 04:26:57 +0000 (20:26 -0800)]
rbd: open (source) image as read-only

This allows users without write access to copy, export and list
information about an image.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agolibrbd: open parent as read-only during clone
Josh Durgin [Sat, 29 Dec 2012 06:13:37 +0000 (22:13 -0800)]
librbd: open parent as read-only during clone

We never write to the parent, and don't need to watch it during this process.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agolibrbd: add {rbd_}open_read_only()
Josh Durgin [Sat, 29 Dec 2012 00:54:51 +0000 (16:54 -0800)]
librbd: add {rbd_}open_read_only()

Since 58890cfad5f7bee933baa599a68e6c65993379d4, regular {rbd_}open()
would fail with -EPERM if the user did not have write access to the
pool, since a watch on the header was requested.

For many uses of read-only access, establishing a watch is not
necessary, since changes to the header do not matter. For example,
getting metadata about an image via 'rbd info' does not care if a new
snapshot is created while it is in progress.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agoOSD: remove RD flag from CALL ops
Josh Durgin [Sat, 29 Dec 2012 03:47:09 +0000 (19:47 -0800)]
OSD: remove RD flag from CALL ops

20496b8d2b2c3779a771695c6f778abbdb66d92a forgot to do this. Without
this change, all class methods required regular read permission in
addition to class-read or class-write.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agocls_rbd: get_children does not need write permission
Josh Durgin [Sat, 29 Dec 2012 03:44:36 +0000 (19:44 -0800)]
cls_rbd: get_children does not need write permission

This prevented a read-only user from being able to unprotect a
snapshot without write permission on all pools. This was masked before
by the CLS_METHOD_PUBLIC flag.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agoinit-ceph: ok, 8K files
Sage Weil [Sat, 29 Dec 2012 01:12:06 +0000 (17:12 -0800)]
init-ceph: ok, 8K files

16K might be a bit many.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomsg/Pipe: remove broken cephs signing requirement check
Sage Weil [Fri, 28 Dec 2012 00:01:49 +0000 (16:01 -0800)]
msg/Pipe: remove broken cephs signing requirement check

Remove the special-case check, which does not inform the peer what
protocol features are missing.  It also enforces this requirement even
when we negotiate auth none.

Reported as part of bug #3657.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomsg/Pipe: include remote socket addr in debug output
Sage Weil [Sat, 29 Dec 2012 00:00:47 +0000 (16:00 -0800)]
msg/Pipe: include remote socket addr in debug output

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: allow RecoveryDone self-transition in RepNotRecovering
Sage Weil [Fri, 28 Dec 2012 19:34:47 +0000 (11:34 -0800)]
osd: allow RecoveryDone self-transition in RepNotRecovering

In a mixed cluster where some OSDs support the recovery reservations and
some don't, the replica may be new code in RepNotRecoverying and will
complete a backfill.  In that case, we want to just stayin
RepNotRecovering.

It may also be possible to make it infer what the primary is doing even
thought it is not sending recovery reservation messages, but this is much
more complicated and doesn't accomplish much.

Fixes: #3689
Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: less noise about inefficient tmap updates
Sage Weil [Fri, 28 Dec 2012 20:34:15 +0000 (12:34 -0800)]
osd: less noise about inefficient tmap updates

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoinit-ceph: default to 16K max_open_files
Sage Weil [Fri, 28 Dec 2012 20:11:55 +0000 (12:11 -0800)]
init-ceph: default to 16K max_open_files

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agorgw: disable ops and usage logging by default
Sage Weil [Thu, 27 Dec 2012 21:27:46 +0000 (13:27 -0800)]
rgw: disable ops and usage logging by default

Most users don't need this, and having it on will just fill their clusters
with objects that will need to be cleaned up later.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
12 years agojava: remove deprecated libcephfs
Noah Watkins [Thu, 27 Dec 2012 20:06:02 +0000 (12:06 -0800)]
java: remove deprecated libcephfs

Removes ceph_set_default_*

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
12 years agoinit-ceph: fix status version check across machines
Sage Weil [Fri, 28 Dec 2012 00:06:24 +0000 (16:06 -0800)]
init-ceph: fix status version check across machines

The local state isn't propagated into the backtick shell, resulting in
'unknown' for all remote daemons.  Avoid backticks altogether.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: fix recovery assert for pg repair case
Sage Weil [Wed, 26 Dec 2012 23:27:07 +0000 (15:27 -0800)]
osd: fix recovery assert for pg repair case

In the case of PG repair, this assert is not valid.  Disable it for now.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agodropping xfs test 186 due to bug: 3685
tamil [Thu, 27 Dec 2012 19:27:31 +0000 (11:27 -0800)]
dropping xfs test 186 due to bug: 3685

Signed-off-by: tamil <tamil.muthamizhan@inktank.com>
12 years agoosd: drop 'osd recovery max active' back to previous default (5)
Sage Weil [Thu, 27 Dec 2012 19:12:33 +0000 (11:12 -0800)]
osd: drop 'osd recovery max active' back to previous default (5)

Having this too large means that queues get too deep on the OSDs during
backfill and latency is very high.  In my tests, it also meant we generated
a lot of slow recovery messages just from the recovery ops themselves (no
client io).

Keeping this at the old default means we are no worse in this respect than
argonaut, which is a safe position to start from.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agojournal: reduce journal max queue size
Sage Weil [Thu, 27 Dec 2012 19:11:08 +0000 (11:11 -0800)]
journal: reduce journal max queue size

Keep the journal queue size smaller than the filestore queue size.

Keeping this small also means that we can lower the latency for new
high priority ops that come into the op queue.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: fix dup failure cancellations
Sage Weil [Sun, 23 Dec 2012 23:17:12 +0000 (15:17 -0800)]
osd: fix dup failure cancellations

If we had a pending failure report, and send a cancellation, take it
out of our pending list so that we don't keep resending cancellations.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: make MOSDFailure output more sensible
Sage Weil [Sun, 23 Dec 2012 23:16:06 +0000 (15:16 -0800)]
osd: make MOSDFailure output more sensible

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomon: make osd failure report log msgs sensible
Sage Weil [Sun, 23 Dec 2012 23:11:39 +0000 (15:11 -0800)]
mon: make osd failure report log msgs sensible

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoMerge branch 'wip-scrub' into next
Sage Weil [Sun, 23 Dec 2012 22:42:51 +0000 (14:42 -0800)]
Merge branch 'wip-scrub' into next

Reviewed-by: Sage Weil <sage@inktank.com>
Conflicts:
src/osd/PG.cc

12 years agomonclient: fix get_monmap_privately retry interval
Sage Weil [Sun, 23 Dec 2012 21:29:08 +0000 (13:29 -0800)]
monclient: fix get_monmap_privately retry interval

Use mon_client_hunt_interval (default 3) instead of hardcoding 1 second.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoMakefile: fix 'base' rule
Sage Weil [Sun, 23 Dec 2012 04:56:45 +0000 (20:56 -0800)]
Makefile: fix 'base' rule

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoinit-ceph,mkcephfs: default inode64 for mounting xfs
Sage Weil [Sun, 23 Dec 2012 19:18:45 +0000 (11:18 -0800)]
init-ceph,mkcephfs: default inode64 for mounting xfs

According to hch this is now the default or new kernels.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoinit-ceph: default osd_data path
Sage Weil [Sat, 22 Dec 2012 19:10:03 +0000 (11:10 -0800)]
init-ceph: default osd_data path

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoOSD: always do a deep scrub when repairing
Samuel Just [Sat, 22 Dec 2012 01:21:59 +0000 (17:21 -0800)]
OSD: always do a deep scrub when repairing

Otherwise, errors turned up in a deep-scrub will be
swept under the rug without being repaired.

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoPG: don't use a self-transition for WaitRemoteRecoveryReserved
Samuel Just [Sat, 22 Dec 2012 00:51:40 +0000 (16:51 -0800)]
PG: don't use a self-transition for WaitRemoteRecoveryReserved

Previously, using the state on active worked, but now we might
go back through WaitRemoteRecoveryReserved without resetting
Active.

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoPG: Handle repair once in scrub_finish
Samuel Just [Fri, 21 Dec 2012 23:39:50 +0000 (15:39 -0800)]
PG: Handle repair once in scrub_finish

We don't want to change missing sets during a chunky
scrub since it would cause !is_clean() and derail
the rest of the scrub.  Instead, move the missing,
inconsistent, and authoritative sets into scrubber
and add to during scrub_compare_maps().  Then,
handle repairing objects all at once in scrub_finish().

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoimport_export.sh: sparse import export
Dan Mick [Fri, 21 Dec 2012 03:53:07 +0000 (19:53 -0800)]
import_export.sh: sparse import export

Add tests for:
   - sparse import makes expected sparse images
   - sparse export makes expected sparse files
   - sparse import from stdin also creates sparse images
   - import from partially-sparse file leads to partially-sparse image
   - import from stdin with zeros leads to sparse
   - export from zeros-image to file leads to sparse file

Signed-off-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
12 years agorbd: harder-working sparse import from stdin
Dan Mick [Sat, 8 Dec 2012 06:57:06 +0000 (22:57 -0800)]
rbd: harder-working sparse import from stdin

Try to accumulate image-sized blocks when importing from stdin, even if
each read is shorter than requested; if we get a full block, and it's
all zeroes, we can seek and make a sparse output file

Signed-off-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
12 years agorbd: check for all-zero buf in export, seek output if so
Dan Mick [Thu, 20 Dec 2012 22:00:12 +0000 (14:00 -0800)]
rbd: check for all-zero buf in export, seek output if so

Use buf_is_zero in common/util.cc

Signed-off-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
12 years agolibrbd: move buf_is_zero() to new common/util.cc and include/util.h
Dan Mick [Thu, 20 Dec 2012 21:58:55 +0000 (13:58 -0800)]
librbd: move buf_is_zero() to new common/util.cc and include/util.h

Signed-off-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
12 years agoosd: fix pg stat msgs vs timeout
Sage Weil [Sat, 22 Dec 2012 00:47:50 +0000 (16:47 -0800)]
osd: fix pg stat msgs vs timeout

We can get a pattern like so:

- new mon session
- after say 120 seconds, we decide to send a stats msg
- outstanding_pg_stats is finally true, we immediately time out (30 second
  grace), and reconnect to a new mon
-> repeat

The problem is that we don't reset the last_sent timestamp when we send.
Or that we do this check after sending instead of before.  Fix both.

This should resolve the issue #3661 where osds that don't have pgs
updating are not stats messags to the mon to check in, and are eventually
getting marked down as a result.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
12 years agoPG::scrub_compare_maps increment scrubber.fixed for missing repairs
Samuel Just [Fri, 21 Dec 2012 23:20:22 +0000 (15:20 -0800)]
PG::scrub_compare_maps increment scrubber.fixed for missing repairs

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoPG::_compare_scrubmaps: increment scrubber.errors on missing object
Samuel Just [Fri, 21 Dec 2012 23:16:19 +0000 (15:16 -0800)]
PG::_compare_scrubmaps: increment scrubber.errors on missing object

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agomkcephfs: error out if 'devs' defined but 'osd fs type' not defined
Sage Weil [Fri, 21 Dec 2012 22:23:14 +0000 (14:23 -0800)]
mkcephfs: error out if 'devs' defined but 'osd fs type' not defined

We can infer btrfs if they use btrfs devs, but if they use devs there is
no default fs.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoMerge remote-tracking branch 'gh/wip-scrub' into next
Sage Weil [Fri, 21 Dec 2012 21:56:16 +0000 (13:56 -0800)]
Merge remote-tracking branch 'gh/wip-scrub' into next

12 years agoMerge remote-tracking branch 'gh/wip-3643' into next
Sage Weil [Fri, 21 Dec 2012 21:45:39 +0000 (13:45 -0800)]
Merge remote-tracking branch 'gh/wip-3643' into next

Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
12 years agomonc: only warn about missing keyring if we fail to authenticate
Sage Weil [Fri, 21 Dec 2012 21:44:19 +0000 (13:44 -0800)]
monc: only warn about missing keyring if we fail to authenticate

This avoids the situation where a librados or other user with the default
of 'cephx,none' and no keyring is authenticating against a cluster with
required of 'none' and an annoying warning is generated every time.  Now
we only print a helpful message if we actually failed.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: clear CLEAN on exit from Clean state
Sage Weil [Fri, 21 Dec 2012 19:44:35 +0000 (11:44 -0800)]
osd: clear CLEAN on exit from Clean state

This means we can drop the scrub repair state_clear() call.  We probably
can drop others, but lets leave that for another day.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoauth: use none auth if keyring not found
Yehuda Sadeh [Fri, 21 Dec 2012 20:14:40 +0000 (12:14 -0800)]
auth: use none auth if keyring not found

If both cephx and none are accepted auth methods, and
cephx keyring cannot be found then resort to using
none, instead of failing.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
12 years agoPG::sched_scrub: only set PG_STATE_DEEP_SCRUB once reserved
Samuel Just [Fri, 21 Dec 2012 19:36:04 +0000 (11:36 -0800)]
PG::sched_scrub: only set PG_STATE_DEEP_SCRUB once reserved

Otherwise we would have +DEEP before we have +SCRUB.

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoPG::sched_scrub: return true if scrub newly kicked off
Samuel Just [Fri, 21 Dec 2012 19:33:45 +0000 (11:33 -0800)]
PG::sched_scrub: return true if scrub newly kicked off

The previous return value wasn't really what OSD::sched_scrub
wanted to know.

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoosd: allow transition from Clean -> WaitLocalRecoveryReserved for repair
Sage Weil [Fri, 21 Dec 2012 19:37:48 +0000 (11:37 -0800)]
osd: allow transition from Clean -> WaitLocalRecoveryReserved for repair

If we do a scrub repair, we need to go from clean to recovery again to
copy objects around.

This fixes a simple repair of a missing object, either on the primary or
replica.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoPG: in sched_scrub() set PG_STATE_DEEP_SCRUB not scrubber.deep
Samuel Just [Fri, 21 Dec 2012 19:17:23 +0000 (11:17 -0800)]
PG: in sched_scrub() set PG_STATE_DEEP_SCRUB not scrubber.deep

scrubber.deep gets reset in scrub() to match
state_test(PG_STATE_DEEP_SCRUB).

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoosd: clear scrub state if queued scrub doesn't start
Sage Weil [Fri, 21 Dec 2012 06:01:34 +0000 (22:01 -0800)]
osd: clear scrub state if queued scrub doesn't start

We set SCRUBBING when we queue a pg for scrub.  If we dequeue and
call scrub() but abort for some reason (!active, degraded, etc.), clear
that state bit.

Bug is easily reproduced with 'ceph osd scrub N' during cluster startup
when PGs are peering; some PGs can get left in the scrubbing state.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: only dec_scrubs_active if we were active
Sage Weil [Fri, 21 Dec 2012 05:45:09 +0000 (21:45 -0800)]
osd: only dec_scrubs_active if we were active

This fixes a bug that puts scrubs_active negative.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: reintroduce inc_scrubs_active helper
Sage Weil [Fri, 21 Dec 2012 05:44:34 +0000 (21:44 -0800)]
osd: reintroduce inc_scrubs_active helper

This mostly generates nice debug output.  It also slightly simplifies
code and makes things symmetric.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoMerge remote-tracking branch 'upstream/wip_notify' into next
Samuel Just [Fri, 21 Dec 2012 00:23:23 +0000 (16:23 -0800)]
Merge remote-tracking branch 'upstream/wip_notify' into next

Reviewed-by: Sage Weil <sage@inktank.com>
12 years agocephtool: mention ceph osd ls, fix ceph osd tell N bench
Dan Mick [Thu, 20 Dec 2012 23:31:21 +0000 (15:31 -0800)]
cephtool: mention ceph osd ls, fix ceph osd tell N bench

Add ceph osd ls to help; make help for ceph osd tell N bench look
more like injectargs, which says <osd-id or *> to make it clear you
can benchmark all osds simultaneously

Signed-off-by: Dan Mick <dan.mick@inktank.com>
12 years agorgw: remove noisy log message
Yehuda Sadeh [Thu, 20 Dec 2012 23:32:59 +0000 (15:32 -0800)]
rgw: remove noisy log message

No need for that log message.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
12 years agorgw: fix daemonize initialization
Yehuda Sadeh [Thu, 20 Dec 2012 23:21:48 +0000 (15:21 -0800)]
rgw: fix daemonize initialization

Just call the common daemonize function. Otherwise we end up
not initializng stdout / stderr correctly.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
12 years agolog: fix flush/signal race
Sage Weil [Thu, 20 Dec 2012 21:48:06 +0000 (13:48 -0800)]
log: fix flush/signal race

We need to signal the cond in the same interval where we hold the lock
*and* modify the queue.  Otherwise, we can have a race like:

 queue has 1 item, max is 1.
 A: enter submit_entry, signal cond, wait on condition
 B: enter submit_entry, signal cond, wait on condition
 C: flush wakes up, flushes 1 previous item
 A: retakes lock, enqueues something, exits
 B: retakes lock, condition fails, waits
  -> C is never woken up as there are 2 items waiting

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
12 years agoReplicatedPG::remove_notify : don't leak the notify object
Samuel Just [Thu, 20 Dec 2012 21:29:09 +0000 (13:29 -0800)]
ReplicatedPG::remove_notify : don't leak the notify object

Following remove_notify, there are no other references to
notif, delete it.

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoOSD,ReplicatedPG: do not track notifies on the session
Samuel Just [Thu, 20 Dec 2012 21:23:27 +0000 (13:23 -0800)]
OSD,ReplicatedPG: do not track notifies on the session

handle_notify_timeout and remove_notify currently do not clean up this
state leaving dangling Notification*.  Further, we only use this mapping
in unwatch in order to determine which notifies to update. We can
accomplish the same thing by iterating through the obc->notifs mapping
since all notifications relevant for a given watch would have been for
the same obc as the watch.

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoMerge remote-tracking branch 'gh/wip-cephtool' into next
Sage Weil [Thu, 20 Dec 2012 19:04:29 +0000 (11:04 -0800)]
Merge remote-tracking branch 'gh/wip-cephtool' into next

12 years agoMerge branch 'wip-build-fixes' into next
Sage Weil [Thu, 20 Dec 2012 18:49:34 +0000 (10:49 -0800)]
Merge branch 'wip-build-fixes' into next

12 years agorgw: configurable exit timeout
Yehuda Sadeh [Tue, 18 Dec 2012 21:53:09 +0000 (13:53 -0800)]
rgw: configurable exit timeout

Fixes: #3638
rgw exit timeout secs : number of seconds to wait for process
to exit cleanly before forcing exit. If set to 0, it'l wait
indefinitely.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>