]> git.apps.os.sepia.ceph.com Git - ceph.git/log
ceph.git
12 years agorbd: support plain/json/xml output formatting
Stratos Psomadakis [Thu, 27 Dec 2012 00:14:39 +0000 (16:14 -0800)]
rbd: support plain/json/xml output formatting

This patch renames the --format option to --image-format, for
specifying the RBD image format, and uses --format to specify the
output formatting (to be consistent with the other ceph tools). To
avoid breaking backwards compatibility with existing scripts, rbd will
still accept --format [1|2] for the image format, but will print a
warning message, noting its use is deprecated.

The rbd subcommands that support the new --format option are : ls, info, snap
list, children, showmapped, lock list.

Signed-off-by: Stratos Psomadakis <psomas@grnet.gr>
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agomon: OSDMonitor: only share osdmap with up OSDs
Joao Eduardo Luis [Sat, 12 Jan 2013 01:06:36 +0000 (01:06 +0000)]
mon: OSDMonitor: only share osdmap with up OSDs

Try to share the map with a randomly picked OSD; if the picked monitor is
not 'up', then try to find the nearest 'up' OSD in the map by doing a
backward and a forward linear search on the map -- this would be O(n) in
the worst case scenario, as we only do a single iteration starting on the
picked position, incrementing and decrementing two different iterators
until we find an appropriate OSD or we exhaust the map.

Fixes: #3629
Backport: bobtail

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
12 years agoqa/run_xfstests.sh: use cloned xfstests repository
Alex Elder [Fri, 11 Jan 2013 18:49:36 +0000 (12:49 -0600)]
qa/run_xfstests.sh: use cloned xfstests repository

Use our own copy of the xfstests repository rather than hitting
the upstream one repeatedly.

Signed-off-by: Alex Elder <elder@inktank.com>
12 years agorados: add truncate support
Samuel Just [Fri, 4 Jan 2013 05:13:44 +0000 (21:13 -0800)]
rados: add truncate support

Signed-off-by: Samuel Just <sam.just@inktank.com>
Revewed-by: Greg Farnum <greg@inktank.com>
12 years agoReplicatedPG: fix snapdir trimming
Samuel Just [Thu, 10 Jan 2013 03:17:23 +0000 (19:17 -0800)]
ReplicatedPG: fix snapdir trimming

The previous logic was both complicated and not correct.  Consequently,
we have been tending to drop snapcollection links in some cases.  This
has resulted in clones incorrectly not being trimmed.  This patch
replaces the logic with something less efficient but hopefully a bit
clearer.

Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
12 years agoRevert "rgw: fix handler leak in handle_request"
Yehuda Sadeh [Thu, 10 Jan 2013 18:14:11 +0000 (10:14 -0800)]
Revert "rgw: fix handler leak in handle_request"

This reverts commit eba314a811cd98a79f483dc7a9128fe76c722c78.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
12 years agorgw: Fix crash when FastCGI frontend doesn't set SCRIPT_URI
Sylvain Munaut [Mon, 7 Jan 2013 12:13:49 +0000 (13:13 +0100)]
rgw: Fix crash when FastCGI frontend doesn't set SCRIPT_URI

Fixes: #3735
Signed-off-by: caleb miles <caleb.miles@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
12 years agorgw: fix handler leak in handle_request
caleb miles [Tue, 8 Jan 2013 20:56:00 +0000 (15:56 -0500)]
rgw: fix handler leak in handle_request

Fixes: #3682
Signed-off-by: caleb miles <caleb.miles@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
12 years agolibrbd: Allow get_lock_info to fail
Dan Mick [Tue, 8 Jan 2013 19:21:22 +0000 (11:21 -0800)]
librbd: Allow get_lock_info to fail

If the lock class isn't present, EOPNOTSUPP is returned for lock calls
on newer OSDs, but sadly EIO on older; we need to treat both as
acceptable failures for RBD images.  rados lock list will still fail.

Fixes #3744.

Signed-off-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
12 years agoPG: set DEGRADED in Active AdvMap handler based on pool size
Samuel Just [Mon, 7 Jan 2013 23:02:34 +0000 (15:02 -0800)]
PG: set DEGRADED in Active AdvMap handler based on pool size

Otherwise, if the acting set does not change, the pg might
not show up as degraded if the pool size now exceeds the
acting set size.

Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
12 years agoMerge branch 'wip-3678-b' into next
Sage Weil [Mon, 7 Jan 2013 21:04:13 +0000 (13:04 -0800)]
Merge branch 'wip-3678-b' into next

Reviewed-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
12 years agomsg/Pipe: prepare Message data for wire under pipe_lock
Sage Weil [Sun, 6 Jan 2013 16:38:27 +0000 (08:38 -0800)]
msg/Pipe: prepare Message data for wire under pipe_lock

We cannot trust the Message bufferlists or other structures to be
stable without pipe_lock, as another Pipe may claim and modify the sent
list items while we are writing to the socket.

Related to #3678.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomsgr: update Message envelope in encode, not write_message
Sage Weil [Sun, 6 Jan 2013 16:33:01 +0000 (08:33 -0800)]
msgr: update Message envelope in encode, not write_message

Fill out the Message header, footer, and calculate CRCs during
encoding, not write_message().  This removes most modifications from
Pipe::write_message().

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosdc/Objecter: fix linger_ops iterator invalidation on pool deletion
Sage Weil [Mon, 7 Jan 2013 20:58:39 +0000 (12:58 -0800)]
osdc/Objecter: fix linger_ops iterator invalidation on pool deletion

The call to check_linger_pool_dne() may unregister the linger request,
invalidating the iterator.  To avoid this, increment the iterator at
the top of the loop.

This mirror the fix in 4bf9078286d58c2cd4e85cb8b31411220a377092 for
regular non-linger ops.

Fixes: #3734
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomsg/Pipe: encode message inside pipe_lock
Sage Weil [Sun, 6 Jan 2013 16:25:40 +0000 (08:25 -0800)]
msg/Pipe: encode message inside pipe_lock

This modifies bufferlists in the Message struct, and it is possible
for multiple instances of the Pipe to get references on the Message;
make sure they don't modify those bufferlists concurrently.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomsg/Pipe: associate sending msgs to con inside lock
Sage Weil [Sat, 5 Jan 2013 18:39:08 +0000 (10:39 -0800)]
msg/Pipe: associate sending msgs to con inside lock

Associate a sending message with the connection inside the pipe_lock.
This way if a racing thread tries to steal these messages it will
be sure to reset the con point *after* we do such that it the con
pointer is valid in encode_payload() (and later).

This may be part of #3678.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomsg/Pipe: fix msg leak in requeue_sent()
Sage Weil [Sat, 5 Jan 2013 17:29:50 +0000 (09:29 -0800)]
msg/Pipe: fix msg leak in requeue_sent()

The sent list owns a reference to each message.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoos/FileJournal: include limits.h
Sage Weil [Sun, 6 Jan 2013 04:53:49 +0000 (20:53 -0800)]
os/FileJournal: include limits.h

Needed for IOV_MAX.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: special case CALL op to not have RD bit effects
Sage Weil [Sat, 5 Jan 2013 01:43:41 +0000 (17:43 -0800)]
osd: special case CALL op to not have RD bit effects

In commit 20496b8d2b2c3779a771695c6f778abbdb66d92a we treat a CALL as
different from a normal "read", but we did not adjust the behavior
determined by the RD bit in the op.  We tried to fix that in
91e941aef9f55425cc12204146f26d79c444cfae, but changing the op code breaks
compatibility, so that was reverted.

Instead, special-case CALL in the helper--the only point in the code that
actually checks for the RD bit.  (And fix one lingering user to use that
helper appropriately.)

Fixes: #3731
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
12 years agoRevert "OSD: remove RD flag from CALL ops"
Sage Weil [Sat, 5 Jan 2013 04:46:48 +0000 (20:46 -0800)]
Revert "OSD: remove RD flag from CALL ops"

This reverts commit 91e941aef9f55425cc12204146f26d79c444cfae.

We cannot change this op code without breaking compatibility
with old code (client and server).  We'll have to special case
this op code instead.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
12 years agoReplicatedPG: remove old-head optization from push_to_replica
Samuel Just [Fri, 4 Jan 2013 20:43:52 +0000 (12:43 -0800)]
ReplicatedPG: remove old-head optization from push_to_replica

This optimization allowed the primary to push a clone as a single push in the
case that the head object on the replica is old and happens to be at the same
version as the clone.  In general, using head in clone_subsets is tricky since
we might be writing to head during the push.  calc_clone_subsets does not
consider head (probably for this reason).  Handling the clone from head case
properly would require blocking writes on head in the interim which is probably
a bad trade off anyway.

Because the old-head optimization only comes into play if the replica's state
happens to fall on the last write to head prior to the snap that caused the
clone in question, it's not worth the complexity.

Fixes: #3698
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
12 years agoos/FileStore: fix non-btrfs op_seq commit order
Sage Weil [Fri, 4 Jan 2013 01:15:07 +0000 (17:15 -0800)]
os/FileStore: fix non-btrfs op_seq commit order

The op_seq file is the starting point for journal replay.  For stable btrfs
commit mode, which is using a snapshot as a reference, we should write this
file before we take the snap.  We normally ignore current/ contents anyway.

On non-btrfs file systems, however, we should only write this file *after*
we do a full sync, and we should then fsync(2) it before we continue
(and potentially trim anything from the journal).

This fixes a serious bug that could cause data loss and corruption after
a power loss event.  For a 'kill -9' or crash, however, there was little
risk, since the writes were still captured by the host's cache.

Fixes: #3721
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
12 years agoOSD: for old osds, dispatch peering messages immediately
Samuel Just [Thu, 3 Jan 2013 17:59:45 +0000 (09:59 -0800)]
OSD: for old osds, dispatch peering messages immediately

Normally, we batch up peering messages until the end of
process_peering_events to allow us to combine many notifies, etc
to the same osd into the same message.  However, old osds assume
that the actiavtion message (log or info) will be _dispatched
before the first sub_op_modify of the interval.  Thus, for those
peers, we need to send the peering messages before we drop the
pg lock, lest we issue a client repop from another thread before
activation message is sent.

Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
12 years agoMerge remote-tracking branch 'gh/wip-3714-b' into next
Sage Weil [Thu, 3 Jan 2013 20:53:07 +0000 (12:53 -0800)]
Merge remote-tracking branch 'gh/wip-3714-b' into next

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoosd: move common active vs booting code into consume_map
Sage Weil [Thu, 3 Jan 2013 06:38:53 +0000 (22:38 -0800)]
osd: move common active vs booting code into consume_map

Push osdmaps to PGs in separate method from activate_map() (whose name
is becoming less and less accurate).

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: let pgs process map advances before booting
Sage Weil [Thu, 3 Jan 2013 06:20:06 +0000 (22:20 -0800)]
osd: let pgs process map advances before booting

The OSD deliberate consumes and processes most OSDMaps from while it
was down before it marks itself up, as this is can be slow.  The new
threading code does this asynchronously in peering_wq, though, and
does not let it drain before booting the OSD.  The OSD can get into
a situation where it marks itself up but is not responsive or useful
because of the backlog, and only makes the situation works by
generating more osdmaps as result.

Fix this by calling activate_map() even when booting, and when booting
draining the peering_wq on each call.  This is harmless since we are
not yet processing actual ops; we only need to be async when active.

Fixes: #3714
Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: drop oldest_last_clean from activate_map
Sage Weil [Thu, 3 Jan 2013 06:04:34 +0000 (22:04 -0800)]
osd: drop oldest_last_clean from activate_map

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: drop unused variables from activate_map
Sage Weil [Thu, 3 Jan 2013 06:04:08 +0000 (22:04 -0800)]
osd: drop unused variables from activate_map

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoOSDMap: fix modifed -> modified typo
Sage Weil [Thu, 3 Jan 2013 05:09:07 +0000 (21:09 -0800)]
OSDMap: fix modifed -> modified typo

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agolog: fix locking typo/stupid for dump_recent()
Sage Weil [Wed, 2 Jan 2013 21:58:44 +0000 (13:58 -0800)]
log: fix locking typo/stupid for dump_recent()

We weren't locking m_flush_mutex properly, which in turn was leading to
racing threads calling dump_recent() and garbling the crash dump output.

Backport: bobtail, argonaut
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
12 years agofuse: Fix cleanup code path on init failure
Sam Lang [Wed, 2 Jan 2013 22:07:13 +0000 (16:07 -0600)]
fuse: Fix cleanup code path on init failure

With the changes from 856f32ab, the cfuse.init call returns
a _positive_ errno, which was getting ignored.  Also, if an
error occurs during cfuse.init(), we need to teardown the client
mount.

Signed-off-by: Sam Lang <sam.lang@inktank.com>
12 years agoMerge branch 'wip-journal-aio' into next
Sage Weil [Wed, 2 Jan 2013 21:42:22 +0000 (13:42 -0800)]
Merge branch 'wip-journal-aio' into next

Reviewed-by: Samuel Just <sam.just@inktank.com>
Backport: bobtail

12 years agotest_filejournal: optionally specify journal filename as an argument
Sage Weil [Sat, 29 Dec 2012 00:48:22 +0000 (16:48 -0800)]
test_filejournal: optionally specify journal filename as an argument

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agotest_filejournal: test journaling bl with >IOV_MAX segments
Sage Weil [Sat, 29 Dec 2012 00:48:05 +0000 (16:48 -0800)]
test_filejournal: test journaling bl with >IOV_MAX segments

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoos/FileJournal: limit size of aio submission
Sage Weil [Sat, 29 Dec 2012 00:47:28 +0000 (16:47 -0800)]
os/FileJournal: limit size of aio submission

Limit size of each aio submission to IOV_MAX-1 (to be safe).  Take care to
only mark the last aio with the seq to signal completion.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoMerge branch 'master' of https://github.com/ceph/ceph
Gary Lowell [Tue, 1 Jan 2013 05:35:03 +0000 (21:35 -0800)]
Merge branch 'master' of https://github.com/ceph/ceph

12 years agoMerge branch 'next'
Gary Lowell [Tue, 1 Jan 2013 05:31:17 +0000 (21:31 -0800)]
Merge branch 'next'

12 years agoMerge branch 'next'
Sage Weil [Tue, 1 Jan 2013 02:37:12 +0000 (18:37 -0800)]
Merge branch 'next'

12 years agoMerge remote-tracking branch 'gh/wip-3675'
Sage Weil [Tue, 1 Jan 2013 02:36:39 +0000 (18:36 -0800)]
Merge remote-tracking branch 'gh/wip-3675'

Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
12 years agov0.56 v0.56
Gary Lowell [Tue, 1 Jan 2013 01:10:11 +0000 (17:10 -0800)]
v0.56

12 years agoclient: fix _create created ino condition
Sage Weil [Mon, 31 Dec 2012 23:28:25 +0000 (15:28 -0800)]
client: fix _create created ino condition

We get 8 bytes back for the created ino.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agolibcephfs: choose more unique nonce
Sage Weil [Mon, 31 Dec 2012 23:22:23 +0000 (15:22 -0800)]
libcephfs: choose more unique nonce

We were using a per-process counter combined with the pid.  A short
running process can easily loop through and reuse the same pid later.
Instead, go for 48 bits of randomness and the pid.  This way if we get
a dup pid we'll only get a dup nonce once out of 2^48 tries.

Avoids #3630 when running a libcephfs test in a loop (so that the pid
is eventually reused).  This is a better fix than the broken
8b599083705c2495810c00f9f5fd5bb8ace7f32e.  The real solution on the MDS
side involves cleaning up the msgr/MDS interaction with session
shutdown.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoclient: fix _create
Sage Weil [Mon, 31 Dec 2012 23:23:29 +0000 (15:23 -0800)]
client: fix _create

make_request() clear out req->reply and frees req; we can't inspect
it here.

Instead, just assume that extra_bl is the create flag/ino if it is
present.  Old code does not include an extra_bl on CREATE, and new code
will have the same first bytes for compatibility.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoMerge remote-tracking branch 'gh/wip-3625'
Sage Weil [Mon, 31 Dec 2012 18:16:31 +0000 (10:16 -0800)]
Merge remote-tracking branch 'gh/wip-3625'

12 years agoMerge remote-tracking branch 'gh/wip-rbd-unprotect' into next
Sage Weil [Sun, 30 Dec 2012 23:29:37 +0000 (15:29 -0800)]
Merge remote-tracking branch 'gh/wip-rbd-unprotect' into next

Reviewed-by: Sage Weil <sage@inktank.com>
12 years agodoc: add-or-rm-mons.rst: Add 'Changing Monitor's IPs' section
Joao Eduardo Luis [Thu, 20 Dec 2012 18:25:14 +0000 (18:25 +0000)]
doc: add-or-rm-mons.rst: Add 'Changing Monitor's IPs' section

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
12 years agodoc: add-or-rm-mons.rst: Clarify what the monitor name/id is.
Joao Eduardo Luis [Wed, 19 Dec 2012 16:48:37 +0000 (16:48 +0000)]
doc: add-or-rm-mons.rst: Clarify what the monitor name/id is.

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
12 years agodoc: fix rbd permissions for unprotect
Josh Durgin [Sun, 30 Dec 2012 07:57:01 +0000 (23:57 -0800)]
doc: fix rbd permissions for unprotect

Unprotect examines all pools, so use blanket x before 0.54. After
that, use class-read restricted by object_prefix to rbd_children.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agolibrbd: fix race between unprotect and clone
Josh Durgin [Sun, 30 Dec 2012 04:35:15 +0000 (20:35 -0800)]
librbd: fix race between unprotect and clone

Clone needs to actually re-read the header to make sure the image is
still protected before returning. Additionally, it needs to consider
the image protected *only* if the protection status is protected -
unprotecting does not count. I thought I'd already fixed this, but
can't find the commit.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agorbd: open (source) image as read-only
Josh Durgin [Sun, 30 Dec 2012 04:26:57 +0000 (20:26 -0800)]
rbd: open (source) image as read-only

This allows users without write access to copy, export and list
information about an image.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agolibrbd: open parent as read-only during clone
Josh Durgin [Sat, 29 Dec 2012 06:13:37 +0000 (22:13 -0800)]
librbd: open parent as read-only during clone

We never write to the parent, and don't need to watch it during this process.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agolibrbd: add {rbd_}open_read_only()
Josh Durgin [Sat, 29 Dec 2012 00:54:51 +0000 (16:54 -0800)]
librbd: add {rbd_}open_read_only()

Since 58890cfad5f7bee933baa599a68e6c65993379d4, regular {rbd_}open()
would fail with -EPERM if the user did not have write access to the
pool, since a watch on the header was requested.

For many uses of read-only access, establishing a watch is not
necessary, since changes to the header do not matter. For example,
getting metadata about an image via 'rbd info' does not care if a new
snapshot is created while it is in progress.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agoOSD: remove RD flag from CALL ops
Josh Durgin [Sat, 29 Dec 2012 03:47:09 +0000 (19:47 -0800)]
OSD: remove RD flag from CALL ops

20496b8d2b2c3779a771695c6f778abbdb66d92a forgot to do this. Without
this change, all class methods required regular read permission in
addition to class-read or class-write.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agocls_rbd: get_children does not need write permission
Josh Durgin [Sat, 29 Dec 2012 03:44:36 +0000 (19:44 -0800)]
cls_rbd: get_children does not need write permission

This prevented a read-only user from being able to unprotect a
snapshot without write permission on all pools. This was masked before
by the CLS_METHOD_PUBLIC flag.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agoRevert "mds: replace closed sessions on connect"
Sage Weil [Sat, 29 Dec 2012 16:38:52 +0000 (08:38 -0800)]
Revert "mds: replace closed sessions on connect"

This reverts commit 8b599083705c2495810c00f9f5fd5bb8ace7f32e.

This fix is not correct.  See #3696.

12 years agomsg/Pipe: use state_closed atomic_t for _lookup_pipe
Sage Weil [Sat, 29 Dec 2012 01:20:43 +0000 (17:20 -0800)]
msg/Pipe: use state_closed atomic_t for _lookup_pipe

We shouldn't look at Pipe::state in SimpleMessenger::_lookup_pipe() without
holding pipe_lock.  Instead, use an atomic that we set to non-zero only
when transitioning to the terminal STATE_CLOSED state.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomsgr: inject delays at inconvenient times
Sage Weil [Sun, 23 Dec 2012 21:43:15 +0000 (13:43 -0800)]
msgr: inject delays at inconvenient times

Exercise some rare races by injecting delays before taking locks
via the 'ms inject internal delays' option.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomsgr: fix race on Pipe removal from hash
Sage Weil [Sun, 23 Dec 2012 17:22:18 +0000 (09:22 -0800)]
msgr: fix race on Pipe removal from hash

When a pipe is faulting and shutting down, we have to drop pipe_lock to
take msgr lock and then remove the entry.  The Pipe in this case will
have STATE_CLOSED.  Handle this case in all places we do a lookup on
the rank_pipe hash so that we effectively ignore entries that are
CLOSED.

This fixes a race introduced by the previous commit where we won't use
the CLOSED pipe and try to register a new one, but the old one is still
registered.

See bug #3675.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomsgr: don't queue message on closed pipe
Sage Weil [Sun, 23 Dec 2012 17:19:05 +0000 (09:19 -0800)]
msgr: don't queue message on closed pipe

If we have a con that refs a pipe but it is closed, don't use it.  If
the ref is still there, it is only because we are racing with fault()
and it is about to (or just was) be detached.  Either way,

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomsgr: atomically queue first message with connect_rank
Sage Weil [Sun, 23 Dec 2012 05:24:52 +0000 (21:24 -0800)]
msgr: atomically queue first message with connect_rank

Atomically queue the first message on the new pipe, without dropping
and retaking pipe_lock.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoMerge remote-tracking branch 'gh/next'
Sage Weil [Sat, 29 Dec 2012 01:19:46 +0000 (17:19 -0800)]
Merge remote-tracking branch 'gh/next'

12 years agotest: mon: workloadgen: debug when message fsid != monmap fsid
Joao Eduardo Luis [Wed, 19 Dec 2012 01:37:47 +0000 (01:37 +0000)]
test: mon: workloadgen: debug when message fsid != monmap fsid

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
12 years agotest: mon: workloadgen: assert if monmap's fsid is zero after authenticate
Joao Eduardo Luis [Tue, 18 Dec 2012 15:34:12 +0000 (15:34 +0000)]
test: mon: workloadgen: assert if monmap's fsid is zero after authenticate

Fixes: #3629
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
12 years agodoc: update Hadoop documentation
Noah Watkins [Mon, 24 Dec 2012 00:01:42 +0000 (16:01 -0800)]
doc: update Hadoop documentation

Updates configuration option names, and adds object.size,
localize.reads, and root.dir control options.

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
12 years agoinit-ceph: ok, 8K files
Sage Weil [Sat, 29 Dec 2012 01:12:06 +0000 (17:12 -0800)]
init-ceph: ok, 8K files

16K might be a bit many.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomsg/Pipe: remove broken cephs signing requirement check
Sage Weil [Fri, 28 Dec 2012 00:01:49 +0000 (16:01 -0800)]
msg/Pipe: remove broken cephs signing requirement check

Remove the special-case check, which does not inform the peer what
protocol features are missing.  It also enforces this requirement even
when we negotiate auth none.

Reported as part of bug #3657.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomsg/Pipe: include remote socket addr in debug output
Sage Weil [Sat, 29 Dec 2012 00:00:47 +0000 (16:00 -0800)]
msg/Pipe: include remote socket addr in debug output

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoos/FileJournal: logger is optional
Sage Weil [Fri, 28 Dec 2012 23:44:51 +0000 (15:44 -0800)]
os/FileJournal: logger is optional

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoclient: fix fh leak in non-create case
Sage Weil [Fri, 28 Dec 2012 23:14:25 +0000 (15:14 -0800)]
client: fix fh leak in non-create case

We may take the O_CREAT path and get an fh from _create, but created can
still be false.  In that case, skip the final _open call.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomds: Return created inode in mds reply to create
Sam Lang [Wed, 19 Dec 2012 20:17:29 +0000 (10:17 -1000)]
mds: Return created inode in mds reply to create

If multiple clients race to create a file, multiple clients will send a
create request and get back a valid dentry+inode, but only one client
will actually win the race to create the file.  All other clients should
treat the reply as an open of an existing file and check permissions.
This patch adds the created inode number to the mds create reply if that
request actually created the inode/file (and the feature is supported),
so the client can properly check permissions if the inode number isn't
returned.  Fixes #3625.

Signed-off-by: Sam Lang <sam.lang@inktank.com>
12 years agoclient: Make ll_create use _create
Sam Lang [Mon, 17 Dec 2012 19:54:23 +0000 (09:54 -1000)]
client: Make ll_create use _create

This is a fix for bug #3625, where multiple clients race to create a
file, and the loser returns EEXIST instead of a valid file handle.
The patch modifies ll_create in the Client class to use _create(),
which sends the request to the MDS (where an atomic create/open is
performed).

Signed-off-by: Sam Lang <sam.lang@inktank.com>
12 years agolog: broadcast cond signals
Sage Weil [Fri, 28 Dec 2012 21:07:18 +0000 (13:07 -0800)]
log: broadcast cond signals

We were using a single cond, and only signalling one waiter.  That means
that if the flusher and several logging threads are waiting, and we hit
a limit, we the logger could signal another logger instead of the flusher,
and we could deadlock.

Similarly, if the flusher empties the queue, it might signal only a single
logger, and that logger could re-signal the flusher, and the other logger
could wait forever.

Intead, break the single cond into two: one for loggers, and one for the
flusher.  Always signal the (one) flusher, and always broadcast to all
loggers.

Backport: bobtail, argonaut
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
12 years agoosd: allow RecoveryDone self-transition in RepNotRecovering
Sage Weil [Fri, 28 Dec 2012 19:34:47 +0000 (11:34 -0800)]
osd: allow RecoveryDone self-transition in RepNotRecovering

In a mixed cluster where some OSDs support the recovery reservations and
some don't, the replica may be new code in RepNotRecoverying and will
complete a backfill.  In that case, we want to just stayin
RepNotRecovering.

It may also be possible to make it infer what the primary is doing even
thought it is not sending recovery reservation messages, but this is much
more complicated and doesn't accomplish much.

Fixes: #3689
Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoMerge remote-tracking branch 'origin/wip-gl-docs'
Gary Lowell [Fri, 28 Dec 2012 22:15:37 +0000 (14:15 -0800)]
Merge remote-tracking branch 'origin/wip-gl-docs'

Update release process documentation.

12 years agodocs: fix typo in release-process doc
Gary Lowell [Fri, 28 Dec 2012 22:05:56 +0000 (14:05 -0800)]
docs:  fix typo in release-process doc

Signed-off-by: Gary Lowell <gary.lowell@inktank.com>
12 years agoosd: less noise about inefficient tmap updates
Sage Weil [Fri, 28 Dec 2012 20:34:15 +0000 (12:34 -0800)]
osd: less noise about inefficient tmap updates

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoinit-ceph: default to 16K max_open_files
Sage Weil [Fri, 28 Dec 2012 20:11:55 +0000 (12:11 -0800)]
init-ceph: default to 16K max_open_files

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoceph-fuse: Avoid doing handle cleanup in dtor
Sam Lang [Fri, 28 Dec 2012 19:58:39 +0000 (13:58 -0600)]
ceph-fuse: Avoid doing handle cleanup in dtor

The CephFuse::Handle class needs the client
pointer to be valid for finalizing, so don't finalize
in the destructor (which doesn't get called till the
fuse handle leaves scope), instead use a finalize method that
gets called explicitly before the client pointer is freed.

Signed-off-by: Sam Lang <sam.lang@inktank.com>
12 years agoceph-fuse: Pass client handle as userdata
Sam Lang [Fri, 28 Dec 2012 19:10:04 +0000 (13:10 -0600)]
ceph-fuse:  Pass client handle as userdata

The fuse lowlevel API isn't getting the client
handle when when it gets initialized, resulting
in a null pointer for all the subsequent calls.

Signed-off-by: Sam Lang <sam.lang@inktank.com>
12 years agodoc: warn about using caching without QEMU knowing
Josh Durgin [Wed, 26 Dec 2012 22:25:51 +0000 (14:25 -0800)]
doc: warn about using caching without QEMU knowing

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agorgw: disable ops and usage logging by default
Sage Weil [Thu, 27 Dec 2012 21:27:46 +0000 (13:27 -0800)]
rgw: disable ops and usage logging by default

Most users don't need this, and having it on will just fill their clusters
with objects that will need to be cleaned up later.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
12 years agofeatures is uint64_t
Sage Weil [Fri, 28 Dec 2012 00:38:45 +0000 (16:38 -0800)]
features is uint64_t

This won't bite us for a while yet (we're on bit 26), but it will soon!

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
12 years agoMerge remote-tracking branch 'gh/next'
Sage Weil [Fri, 28 Dec 2012 01:15:29 +0000 (17:15 -0800)]
Merge remote-tracking branch 'gh/next'

12 years agoceph-fuse: Split main into init/main/finalize
Sam Lang [Thu, 6 Dec 2012 05:21:12 +0000 (23:21 -0600)]
ceph-fuse:  Split main into init/main/finalize

With the invalidate callback enabled for fuse, the Client::unmount
call requires the fuse channel and session objects remain for performing
the invalidate callbacks.  This patch splits the ceph_fuse_ll_main
call into init, main, and finalize functions, so finalization of the
channel and session objects can be done after the unmount completes.

The patch includes cleanup for the code in fuse_ll.cc to make it more
in the style of C++ and make use of the pimpl idiom to hide the fuse
structures within the CephFuse::Handle pimpl class.

Signed-off-by: Sam Lang <sam.lang@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
12 years agojava: remove deprecated libcephfs
Noah Watkins [Thu, 27 Dec 2012 20:06:02 +0000 (12:06 -0800)]
java: remove deprecated libcephfs

Removes ceph_set_default_*

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
12 years agoinit-ceph: fix status version check across machines
Sage Weil [Fri, 28 Dec 2012 00:06:24 +0000 (16:06 -0800)]
init-ceph: fix status version check across machines

The local state isn't propagated into the backtick shell, resulting in
'unknown' for all remote daemons.  Avoid backticks altogether.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agodocs: update release process documentation.
Gary Lowell [Thu, 27 Dec 2012 23:39:46 +0000 (15:39 -0800)]
docs:  update release process documentation.

Signed-off-by: Gary Lowell <gary.lowell@inktank.com>
12 years agoMerge remote-tracking branch 'gh/wip-mds'
Sage Weil [Thu, 27 Dec 2012 21:40:01 +0000 (13:40 -0800)]
Merge remote-tracking branch 'gh/wip-mds'

12 years agoosd: fix recovery assert for pg repair case
Sage Weil [Wed, 26 Dec 2012 23:27:07 +0000 (15:27 -0800)]
osd: fix recovery assert for pg repair case

In the case of PG repair, this assert is not valid.  Disable it for now.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoMerge branch 'wip-osd-flags'
Sage Weil [Thu, 27 Dec 2012 21:09:24 +0000 (13:09 -0800)]
Merge branch 'wip-osd-flags'

12 years agoMerge remote-tracking branch 'gh/wip-mds-pool'
Sage Weil [Thu, 27 Dec 2012 21:07:57 +0000 (13:07 -0800)]
Merge remote-tracking branch 'gh/wip-mds-pool'

Reviewed-by: Sam Lang <sam.lang@inktank.com>
12 years agoosd: only calculate OpRequest rmw flags once
Sage Weil [Fri, 7 Dec 2012 21:28:55 +0000 (13:28 -0800)]
osd: only calculate OpRequest rmw flags once

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomessages/MOSDOpReply: remove misleading may_read/may_write
Sage Weil [Fri, 7 Dec 2012 21:18:50 +0000 (13:18 -0800)]
messages/MOSDOpReply: remove misleading may_read/may_write

These are OpRequest properties, calculated/enforced at the OSD.  They don't
belong in the MOSDOp or MOSDOpReply messages.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: move rmw_flags to OpRequest, out of MOSDOp
Sage Weil [Fri, 7 Dec 2012 21:14:26 +0000 (13:14 -0800)]
osd: move rmw_flags to OpRequest, out of MOSDOp

It was very sloppy to put a server-side processing state inside the
messsage.  Move it to the OpRequestRef instead.

Note that the client was filling in bogus data that was then lost during
encoding/decoding; clean that up.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agodropping xfs test 186 due to bug: 3685
tamil [Thu, 27 Dec 2012 19:27:31 +0000 (11:27 -0800)]
dropping xfs test 186 due to bug: 3685

Signed-off-by: tamil <tamil.muthamizhan@inktank.com>
12 years agodocs: remove extra release-process2 file.
Gary Lowell [Thu, 27 Dec 2012 19:12:27 +0000 (11:12 -0800)]
docs:  remove extra release-process2 file.

This file mostly duplicated the existing release documentation.  Differences
have been merged into the primary file.

Signed-off-by: Gary Lowell <gary.lowell@inktank.com>
12 years agoosd: drop 'osd recovery max active' back to previous default (5)
Sage Weil [Thu, 27 Dec 2012 19:12:33 +0000 (11:12 -0800)]
osd: drop 'osd recovery max active' back to previous default (5)

Having this too large means that queues get too deep on the OSDs during
backfill and latency is very high.  In my tests, it also meant we generated
a lot of slow recovery messages just from the recovery ops themselves (no
client io).

Keeping this at the old default means we are no worse in this respect than
argonaut, which is a safe position to start from.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agojournal: reduce journal max queue size
Sage Weil [Thu, 27 Dec 2012 19:11:08 +0000 (11:11 -0800)]
journal: reduce journal max queue size

Keep the journal queue size smaller than the filestore queue size.

Keeping this small also means that we can lower the latency for new
high priority ops that come into the op queue.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomds: use set to store MDSMap data pools
Sage Weil [Thu, 27 Dec 2012 19:09:00 +0000 (11:09 -0800)]
mds: use set to store MDSMap data pools

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomds: wait for client's mdsmap when specifying data pool
Sage Weil [Wed, 26 Dec 2012 18:45:08 +0000 (10:45 -0800)]
mds: wait for client's mdsmap when specifying data pool

The client may have a newer map than we do; make sure we wait for it lest
we inadvertantly reply because we think the pool doesn't exist.

Signed-off-by: Sage Weil <sage@inktank.com>