]> git.apps.os.sepia.ceph.com Git - ceph.git/log
ceph.git
12 years agolibrbd: handle parent change while async I/Os are in flight
Dan Mick [Sat, 1 Dec 2012 02:11:09 +0000 (18:11 -0800)]
librbd: handle parent change while async I/Os are in flight

During a test_librbd_fsx run including flatten, ImageCtx->parent
was being dereferenced while null.  Between the time the parent
overlap is calculated and the time the guard+write completes
with ENOENT and submits the copyup+write, the parent image
could have changed (by resize) or been made irrelevant (by
child flatten) such that the parent overlap is now incorrect.

Handle "no parent" by just sending the copyup+write; the copyup
part will be a no-op.  Move to WRITE_FLAT state in this case
because there's no more child to deal with.

Handle "overlap changed" by recalculating overlap before
reading parent data; if none is left, don't read, but rather
just clear m_object_image_extents, in which case the copyup
will again be a no-op because it will be of zero length.
However we still have a parent, so stay in WRITE_COPYUP state
and come back through as usual.

Signed-off-by: Dan Mick <dan.mick@inktank.com>
Fixes: #3524
12 years agoStriper: use local variable inside if() that tested it
Dan Mick [Sat, 1 Dec 2012 01:21:24 +0000 (17:21 -0800)]
Striper: use local variable inside if() that tested it

Signed-off-by: Dan Mick <dan.mick@inktank.com>
12 years agoqa: add script for running xfstests in a vm
Josh Durgin [Wed, 5 Dec 2012 23:54:11 +0000 (15:54 -0800)]
qa: add script for running xfstests in a vm

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agoOSD: ignore queries on now deleted pools
Samuel Just [Wed, 5 Dec 2012 19:11:10 +0000 (11:11 -0800)]
OSD: ignore queries on now deleted pools

Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
12 years agoMerge remote-tracking branch 'origin/wip-mds' into next
Greg Farnum [Wed, 5 Dec 2012 00:48:09 +0000 (16:48 -0800)]
Merge remote-tracking branch 'origin/wip-mds' into next

Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agoMerge branch 'wip-filestore' into next
Sage Weil [Tue, 4 Dec 2012 23:05:18 +0000 (15:05 -0800)]
Merge branch 'wip-filestore' into next

Reviewed-by: Sam Just <sam.just@inktank.com>
12 years agoMerge branch 'wip-msgr-delay-queue' into next
Sage Weil [Tue, 4 Dec 2012 22:52:22 +0000 (14:52 -0800)]
Merge branch 'wip-msgr-delay-queue' into next

12 years agomds: journal remote inode's projected parent
Yan, Zheng [Tue, 4 Dec 2012 08:09:48 +0000 (16:09 +0800)]
mds: journal remote inode's projected parent

Server::_rename_prepare() adds remote inode's parent instead of
projected parent to the journal. So during journal replay, the
journal entry for the rename operation will wrongly revert the
remote inode's projected rename. This issue can be reproduced by:

 touch file1
 ln file1 file2
 rm file1
 mv file2 file3

After journal replay, file1 reappears and directory's fragstat
gets corrupted.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
12 years agomds: don't create bloom filter for incomplete dir
Yan, Zheng [Tue, 4 Dec 2012 08:09:47 +0000 (16:09 +0800)]
mds: don't create bloom filter for incomplete dir

Creating bloom filter for incomplete dir that was added by log
replay will confuse subsequent dir lookup and can create null
dentry for existing file. The erroneous null dentry confuses the
fragstat accounting and causes undeletable empty directory.

The fix is check if the dir is complete before creating the bloom
filter. For the MDCache::trim_non_auth{,_subtree} cases, just do
not call CDir::add_to_bloom because bloom filter is useless for
replica.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
12 years agoPG: remove last_epoch_started asserts in proc_primary_info
Samuel Just [Tue, 4 Dec 2012 19:36:58 +0000 (11:36 -0800)]
PG: remove last_epoch_started asserts in proc_primary_info

These asserts are valid for a uniform cluster, but they won't hold
for a replica running a version without the info.last_epoch_started
patch.

Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: journal remote inode's projected parent
Yan, Zheng [Tue, 4 Dec 2012 08:09:48 +0000 (16:09 +0800)]
mds: journal remote inode's projected parent

Server::_rename_prepare() adds remote inode's parent instead of
projected parent to the journal. So during journal replay, the
journal entry for the rename operation will wrongly revert the
remote inode's projected rename. This issue can be reproduced by:

 touch file1
 ln file1 file2
 rm file1
 mv file2 file3

After journal replay, file1 reappears and directory's fragstat
gets corrupted.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
12 years agomds: don't create bloom filter for incomplete dir
Yan, Zheng [Tue, 4 Dec 2012 08:09:47 +0000 (16:09 +0800)]
mds: don't create bloom filter for incomplete dir

Creating bloom filter for incomplete dir that was added by log
replay will confuse subsequent dir lookup and can create null
dentry for existing file. The erroneous null dentry confuses the
fragstat accounting and causes undeletable empty directory.

The fix is check if the dir is complete before creating the bloom
filter. For the MDCache::trim_non_auth{,_subtree} cases, just do
not call CDir::add_to_bloom because bloom filter is useless for
replica.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
12 years agoMerge remote-tracking branch 'gh/wip-mds' into next
Sage Weil [Tue, 4 Dec 2012 13:27:59 +0000 (05:27 -0800)]
Merge remote-tracking branch 'gh/wip-mds' into next

12 years agologrotate: do not spam stdout
Sage Weil [Tue, 4 Dec 2012 13:25:52 +0000 (05:25 -0800)]
logrotate: do not spam stdout

Avoid anything on stdout that will generate cron emails for people.

Reported-by: Stefan Priebe <s.priebe@profihost.ag>
Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoMerge branch 'master' of http://github.com/ceph/ceph
Gary Lowell [Tue, 4 Dec 2012 05:31:14 +0000 (21:31 -0800)]
Merge branch 'master' of http://github.com/ceph/ceph

12 years agoMerge branch 'next'
Gary Lowell [Tue, 4 Dec 2012 05:29:52 +0000 (21:29 -0800)]
Merge branch 'next'

12 years agodoc: Added a striping section for Architecture.
John Wilkins [Tue, 4 Dec 2012 04:48:02 +0000 (20:48 -0800)]
doc: Added a striping section for Architecture.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
12 years agov0.55 v0.55
Gary Lowell [Tue, 4 Dec 2012 03:08:35 +0000 (19:08 -0800)]
v0.55

12 years agoceph.spec.in: Add SLES and remove Fedora from debug package list.
Gary Lowell [Tue, 4 Dec 2012 03:06:42 +0000 (19:06 -0800)]
ceph.spec.in:  Add SLES and remove Fedora from debug package list.

12 years agoMerge branch 'next'
Sage Weil [Mon, 3 Dec 2012 23:33:29 +0000 (15:33 -0800)]
Merge branch 'next'

12 years agotest_rados_api_misc: fix dup rmmkey test
Sage Weil [Mon, 3 Dec 2012 23:29:56 +0000 (15:29 -0800)]
test_rados_api_misc: fix dup rmmkey test

We now expect ENONET as of 9961640f76a950c674c0e7cc2453931088c63fd7
again.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agodoc: Fixed many hyperlinks, a few typos, and some minor clarifications.
John Wilkins [Mon, 3 Dec 2012 20:22:37 +0000 (12:22 -0800)]
doc: Fixed many hyperlinks, a few typos, and some minor clarifications.

fixes: #3564

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
12 years agodoc: Clarified example for root user.
John Wilkins [Mon, 3 Dec 2012 18:48:10 +0000 (10:48 -0800)]
doc: Clarified example for root user.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
12 years agoconfig: we still want osd_thread_recovery_timeout
Sage Weil [Mon, 3 Dec 2012 11:56:15 +0000 (03:56 -0800)]
config: we still want osd_thread_recovery_timeout

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoconfig: Remove unused options
Sam Lang [Sat, 1 Dec 2012 22:59:36 +0000 (16:59 -0600)]
config: Remove unused options

Signed-off-by: Sam Lang <sam.lang@inktank.com>
12 years agoclient: Fix ceph_mount() when subdir is specified
Sam Lang [Thu, 29 Nov 2012 20:32:32 +0000 (14:32 -0600)]
client: Fix ceph_mount() when subdir is specified

If a subdirectory is specified to ceph_mount, the
root inode does not have an ino of CEPH_INO_ROOT, so
cwd will fail to ever find root and eventially hits
an assertion in in->get_first_parent().  This fix uses
the inode stored in the root member instead, ensuring
that we stop wherever the mount is rooted.

Signed-off-by: Sam Lang <sam.lang@inktank.com>
12 years agoosd: EINVAL on unknown TMAP op code
Sage Weil [Wed, 28 Nov 2012 04:04:34 +0000 (20:04 -0800)]
osd: EINVAL on unknown TMAP op code

The old/slow implementation did this, but the optimized version did
not.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomds: use TMAP_RMSLOPPY op when removing dentries
Sage Weil [Wed, 28 Nov 2012 04:43:38 +0000 (20:43 -0800)]
mds: use TMAP_RMSLOPPY op when removing dentries

After replay, we don't know if the dentry removal has already been
committed.  Use a sloppy removal so that we succeed even if we are
repeating the operation.

Conveniently, the previous implementation (pre v0.55) silently ignored
tmap op codes it did not understand, which means this new RMSLOPPY will
be interpreted the same as an actual RMSLOPPY.  That means an v0.55
mds can run against an older osd (say, argonaut) without problems.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: add TMAP_RMSLOPPY op
Sage Weil [Wed, 28 Nov 2012 04:00:19 +0000 (20:00 -0800)]
osd: add TMAP_RMSLOPPY op

Remove a key, but succeed if key already does not exist.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: ENOENT on TMAP_RM on non-existent key
Sage Weil [Wed, 28 Nov 2012 04:02:54 +0000 (20:02 -0800)]
osd: ENOENT on TMAP_RM on non-existent key

This reverts 29fae494d0b1459c8bb934d42446e0ada7355402 and fixes the
alternate implmentation added by 8e91d00b52808aa1a4e3a838deda34a439.
librbd relies the ENOENT return value.

Reported-by: Dan Mick <dan.mick@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoos/JournalingObjectStore: applied_seq -> max_applied_seq
Sage Weil [Sun, 2 Dec 2012 15:31:49 +0000 (07:31 -0800)]
os/JournalingObjectStore: applied_seq -> max_applied_seq

Rename applied_seq to max_applied_seq, since it is a bound; there may be
seq's < max_applied_seq that are not applied.  This aligns the naming with
max_applying_seq.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoos/FileStore: only wait for applying ops to complete before commit
Sage Weil [Sun, 2 Dec 2012 15:29:46 +0000 (07:29 -0800)]
os/FileStore: only wait for applying ops to complete before commit

We can have a large number of operations in the op_wq waiting to be applied
to the fs.  Currently, when we want to commit, we want for them *all* to
apply.  This can take a very long time (the default queue length is 500
operations!).

Instead, mark an Op as started ("applying") when the thread pool actually
starts to apply it.  At that point, only wait for applying ops to complete.
We let any threads with an op seq < max_applying_seq begin as well so that
we have a proper ordering/barrier.  When those flush, applied_seq will ==
max_applying_seq, and that becomes the committing_seq value.

Note that 'applied_seq' is still maintain, but serves no real purpose
except to populate our asserts with sanity checks.  max_applying_seq serves
the purpose applied_seq used to.

This removes once unnecessary source of latency associated with fs
commits.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: fix RepModify when past last_peering_reset
Sage Weil [Sun, 2 Dec 2012 03:15:18 +0000 (19:15 -0800)]
osd: fix RepModify when past last_peering_reset

If we apply or commit a RepModify from a prevous perring interval, we need
to free it.

This fixes 'slow request' messages when in fact clients requests are not
delayed, and plugs the related memory leak.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoMerge remote-tracking branch 'origin/wip-coverity'
Yehuda Sadeh [Sun, 2 Dec 2012 03:32:07 +0000 (19:32 -0800)]
Merge remote-tracking branch 'origin/wip-coverity'

12 years agoMerge remote-tracking branch 'gh/next'
Sage Weil [Sun, 2 Dec 2012 02:23:02 +0000 (18:23 -0800)]
Merge remote-tracking branch 'gh/next'

12 years agoOutputDataSocket: fix uninit var
Sage Weil [Sat, 1 Dec 2012 05:43:06 +0000 (21:43 -0800)]
OutputDataSocket: fix uninit var

CID 745933 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
At (2): Non-static class member "data_size" is not initialized in this constructor nor in any functions that it calls.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agorgw: fix uninit var
Sage Weil [Sat, 1 Dec 2012 05:41:54 +0000 (21:41 -0800)]
rgw: fix uninit var

CID 745935 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
At (2): Non-static class member "expiration" is not initialized in this constructor nor in any functions that it calls.

At (2): Non-static class member "min_len" is not initialized in this constructor nor in any functions that it calls.
At (4): Non-static class member "max_len" is not initialized in this constructor nor in any functions that it calls.
At (6): Non-static class member "ret" is not initialized in this constructor nor in any functions that it calls.
At (8): Non-static class member "len" is not initialized in this constructor nor in any functions that it calls.
At (10): Non-static class member "ofs" is not initialized in this constructor nor in any functions that it calls.
At (12): Non-static class member "supplied_md5_b64" is not initialized in this constructor nor in any functions that it calls.
At (14): Non-static class member "supplied_etag" is not initialized in this constructor nor in any functions that it calls.
CID 745934 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR)
At (16): Non-static class member "data_pending" is not initialized in this constructor nor in any functions that it calls.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agotest/osdc/FakeWriteback: fix uninit var
Sage Weil [Sat, 1 Dec 2012 05:39:05 +0000 (21:39 -0800)]
test/osdc/FakeWriteback: fix uninit var

CID 745936 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
At (2): Non-static class member "m_off" is not initialized in this constructor nor in any functions that it calls.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: fix missing unlock; simplify
Sage Weil [Sat, 1 Dec 2012 05:35:20 +0000 (21:35 -0800)]
osd: fix missing unlock; simplify

Instead of a special-case exit, just skip the con replacement.  Continue
on to mark the old con down.

CID 745920 (#1 of 1): Missing unlock (LOCK)
At (8): Returning without unlocking "this->heartbeat_lock._m".

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomds: fix freeze inode deadlock
Yan, Zheng [Mon, 19 Nov 2012 02:43:46 +0000 (10:43 +0800)]
mds: fix freeze inode deadlock

CInode::freeze_inode() is used in the case of cross authority rename.
Server::handle_slave_rename_prep() calls it to wait for all other
operations on source inode to complete. This happens after all locks
for the rename operation are acquired. But to acquire locks, we need
auth pin locks' parent objects first. So there is an ABBA deadlock
if someone auth pins the source inode after locks for rename are
acquired and before Server::handle_slave_rename_prep() is called.
The fix is freeze and auth pin the source inode at the same time.

This patch introduces CInode::freeze_auth_pin(), it waits for all
other MDRequests to release auth pins, then change the inode to
FROZENAUTHPIN state, this state prevents other MDRequests from
getting new auth pins.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
12 years agomds: use rdlock_try() when checking NULL dentry
Yan, Zheng [Mon, 19 Nov 2012 02:43:48 +0000 (10:43 +0800)]
mds: use rdlock_try() when checking NULL dentry

Use rdlock_try() instead can_read() when path_traverse encounters
a NULL dentry. This can partly avoid infinitely waiting for the
dentry to become readable when the dentry is replica.

Strictly speaking, use rdlock_try() is still enough because auth
MDS may drop the REQRDLOCK message in some cases.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
12 years agomds: allow open_remote_ino() to open xlocked dentry
Yan, Zheng [Mon, 19 Nov 2012 02:43:47 +0000 (10:43 +0800)]
mds: allow open_remote_ino() to open xlocked dentry

discover_ino() has a parameter want_xlocked. The parameter indicates
if remote discover handler can proceed when xlocked dentry is
encountered. open_remote_ino() uses discover_ino() to find non-auth
inode, but always set 'want_xlocked' to false. This may cause
dead lock in some corner cases. For example:

we rename a inode's primary dentry to one of its remote dentry and
send slave request to one witness MDS. but before the slave request
reaches the witness MDS, the inode is trimmed from the witness MDS'
cache. Then when the slave request arrives, open_remote_ino() will
be called during traversing the destpath. open_remote_ino() calls
discover_ino() with 'want_xlocled=false' to find the inode.
discover_ino() sends MDiscover message to the inode's authority MDS.
The handler of MDiscover message finds the inode's primary dentry
is xlocked and it sleeps.

The fix is add a parameter 'want_xlocked' to open_remote_ino() and
make open_remote_ino() pass the parameter to discover_ino().

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
12 years agomds: fix assertion in handle_cache_expire
Yan, Zheng [Mon, 19 Nov 2012 02:43:45 +0000 (10:43 +0800)]
mds: fix assertion in handle_cache_expire

During export, it's possible to get cache expire messages in
DISCOVERING, FREEZING and PREPPING state.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
12 years agomds: fix open_remote_inode race
Yan, Zheng [Mon, 19 Nov 2012 02:43:44 +0000 (10:43 +0800)]
mds: fix open_remote_inode race

discover_ino() may return -ENOENT if it races with other FS activities.
so use C_MDC_RetryOpenRemoteIno instead of C_MDC_OpenRemoteIno as
onfinish callback.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
12 years agomds: consider revoking caps in imported caps as issued
Yan, Zheng [Mon, 19 Nov 2012 02:43:43 +0000 (10:43 +0800)]
mds: consider revoking caps in imported caps as issued

The clients may already send caps release message to the exporting
MDS, so the importing MDS waits for the release message forever.
consider revoking caps as issued can avoid this issue.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
12 years agomds: drop locks if requiring auth pinning new objects.
Yan, Zheng [Mon, 19 Nov 2012 02:43:42 +0000 (10:43 +0800)]
mds: drop locks if requiring auth pinning new objects.

Locker::acquire_locks() skip auth pinning replica object if we only
request a rdlock and the lock is read-lockable. To get all locks,
we may call Locker::acquire_locks() several times, locks in replca
objects may become not read-lockable between calls. So it is
possible we need auth pin new objects after already take some locks.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
12 years agomds: don't forward client request from MDS
Yan, Zheng [Mon, 19 Nov 2012 02:43:40 +0000 (10:43 +0800)]
mds: don't forward client request from MDS

Forwarding client request that was from MDS will trigger assertion
in MDS::forward_message_mds(). MDS only send client requests for
stray migration/reintegration, so it's safe to drop them.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
12 years agomds: call eval() after caps are exported
Yan, Zheng [Mon, 19 Nov 2012 02:43:39 +0000 (10:43 +0800)]
mds: call eval() after caps are exported

For an inode just changed authority, if the new auth MDS want to
change a lock in the inode from 'sync' to 'lock' state before caps
are exported. The lock in replica can be in 'sync->lock' state
because client caps prevent it from transitting to 'lock' state.
So we should call eval() after clearing client caps.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
12 years agomds: clear lock flushed if replica is waiting for AC_LOCKFLUSHED
Yan, Zheng [Mon, 19 Nov 2012 02:43:38 +0000 (10:43 +0800)]
mds: clear lock flushed if replica is waiting for AC_LOCKFLUSHED

So eval_gather() will not skip calling scatter_writebehind(),
otherwise the replica lock may be in flushing state forever.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
12 years agomds: Don't acquire replica object's versionlock
Yan, Zheng [Mon, 19 Nov 2012 02:43:37 +0000 (10:43 +0800)]
mds: Don't acquire replica object's versionlock

Both CInode and CDentry's versionlocks are of type LocalLock.
Acquiring LocalLock in replica object is useless and problematic.
For example, if two requests try acquiring a replica object's
versionlock, the first request succeeds, the second request
is added to wait queue. Later when the first request finishes,
MDCache::request_drop_foreign_locks() finds the lock's parent is
non-auth, it skips waking requests in the wait queue. So the
second request hangs.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
12 years agomds: allow try_eval to eval unstable locks in freezing object
Yan, Zheng [Mon, 19 Nov 2012 02:43:36 +0000 (10:43 +0800)]
mds: allow try_eval to eval unstable locks in freezing object

Unstable locks hold auth_pins on the object, it prevents the freezing
object become frozen and then unfreeze. So try_eval() should not wait
for freezing object

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
12 years agomsg/Pipe: flush delayed messages when stealing/failing pipes
Sage Weil [Sat, 1 Dec 2012 04:23:52 +0000 (20:23 -0800)]
msg/Pipe: flush delayed messages when stealing/failing pipes

If we are failing a pipe, flush the incoming messages before we try to
reconnect.  Similarly, flush queued messages on an existing pipe beore we
replace it.  This ensures that when we get a socket failure and reconnect
the delayed messages are handled in the normal fashion.

Specifically, it fixes a situation like:

 - read msg, update in_seq etc.
 - delay msg
 - pipe faults
 - peer reconnects, we replace existing pipe, discard delayed msgs
 - peer resends msgs
 - we discard, because they are < in_seq

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agorbd: report striping as a feature in rbd info
Dan Mick [Sat, 1 Dec 2012 06:46:36 +0000 (22:46 -0800)]
rbd: report striping as a feature in rbd info

Fixes: #3549
Signed-off-by: Dan Mick <dan.mick@inktank.com>
12 years agoceph-osd: put g_ceph_context before exit
Samuel Just [Sat, 1 Dec 2012 01:57:35 +0000 (17:57 -0800)]
ceph-osd: put g_ceph_context before exit

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoReplicatedPG: only increment active_scrub on primary for final push
Samuel Just [Fri, 30 Nov 2012 22:04:53 +0000 (14:04 -0800)]
ReplicatedPG: only increment active_scrub on primary for final push

We only queue the _applied_recovered_object callback on the primary for the
final push.  It is this callback which decrements active_pushes.  It's ok to
not increment active_pushes for the intermediate pushes since these only affact
a temp file.

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoMerge remote-tracking branch 'gh/wip-osd-msgr'
Sage Weil [Fri, 30 Nov 2012 20:12:23 +0000 (12:12 -0800)]
Merge remote-tracking branch 'gh/wip-osd-msgr'

12 years agoOSDService: make messengers private
Samuel Just [Fri, 30 Nov 2012 19:20:41 +0000 (11:20 -0800)]
OSDService: make messengers private

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoosd/: make OSDService messenger helpers return ConnectionRef
Samuel Just [Fri, 30 Nov 2012 19:08:55 +0000 (11:08 -0800)]
osd/: make OSDService messenger helpers return ConnectionRef

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agomon: PaxosService: cancel proposal timer after election
Joao Eduardo Luis [Fri, 30 Nov 2012 17:16:35 +0000 (17:16 +0000)]
mon: PaxosService: cancel proposal timer after election

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
12 years agoMerge remote-tracking branch 'gh/wip-mds-ls2'
Sage Weil [Fri, 30 Nov 2012 16:26:25 +0000 (08:26 -0800)]
Merge remote-tracking branch 'gh/wip-mds-ls2'

Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agodoc: update kernel recs
Sage Weil [Fri, 30 Nov 2012 01:28:36 +0000 (17:28 -0800)]
doc: update kernel recs

Mention which stable kernels we recommend.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agologrotate: fix rotation
David Zafman [Fri, 30 Nov 2012 02:07:20 +0000 (18:07 -0800)]
logrotate: fix rotation

Fixes: #3554
Always reload with Upstart because in some configs the init.d script doesn't work

Signed-off-by: David Zafman <david.zafman@inktank.com>
12 years agomds: assert segements not emtpy in get_current_segment()
Sage Weil [Thu, 29 Nov 2012 05:21:15 +0000 (21:21 -0800)]
mds: assert segements not emtpy in get_current_segment()

Only one caller can tolerate no segments; make a new
peek_current_segment() for them.

Motivated by paranoia tracking down a crash during client unmount, but
it wasn't this.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomds: be explicit about MDRequest killed state
Sage Weil [Thu, 29 Nov 2012 05:19:37 +0000 (21:19 -0800)]
mds: be explicit about MDRequest killed state

Set the killed flag and use that instead of inferring things from
the session xlist.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomds: drop redundant mdr->committing = true
Sage Weil [Thu, 29 Nov 2012 05:16:31 +0000 (21:16 -0800)]
mds: drop redundant mdr->committing = true

journal_and_reply() does this.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomds: fix request_kill()
Sage Weil [Thu, 29 Nov 2012 05:19:01 +0000 (21:19 -0800)]
mds: fix request_kill()

Only request_cleanup() if the request isn't already committing.  If it
is, wait for it to commit before we clean up.

It might fix all of #3531, #3210, #1947, and #1548.  Maybe.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoRevert "osd: fix leak of heartbeat con on reset"
Sage Weil [Fri, 30 Nov 2012 01:03:30 +0000 (17:03 -0800)]
Revert "osd: fix leak of heartbeat con on reset"

This reverts commit b31a99abda75b9170a5805b02944a0c0c78245b7.

12 years agoclient: only dump cache on umount if we time out
Sage Weil [Fri, 30 Nov 2012 00:45:52 +0000 (16:45 -0800)]
client: only dump cache on umount if we time out

We don't want to dump the cache every time an item is trimmed and the
mount_cond gets signaled; this can make umount crazy-slow when logging is
turned up.

Instead, only dump if we wait 5 seconds without making any progress on
shrinking the cache.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomsg/Pipe: release dispatch throttle on delayed queue discard
Sage Weil [Thu, 29 Nov 2012 18:49:47 +0000 (10:49 -0800)]
msg/Pipe: release dispatch throttle on delayed queue discard

This avoids leaking into the throttle and deadlocking.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomsg/Pipe: start delay thread *after* we know peer type
Sage Weil [Tue, 27 Nov 2012 23:58:09 +0000 (15:58 -0800)]
msg/Pipe: start delay thread *after* we know peer type

At end of connect(), or end of accept().

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomsg/Pipe: drop queue helpers
Sage Weil [Tue, 27 Nov 2012 23:27:18 +0000 (15:27 -0800)]
msg/Pipe: drop queue helpers

There is a single caller; these only obfuscate.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomsg/Pipe: refactor msgr delays
Sage Weil [Tue, 27 Nov 2012 23:36:11 +0000 (15:36 -0800)]
msg/Pipe: refactor msgr delays

- move all delay state into a single class
- create thread once and only once per Pipe
- adjust debug levels
- discard messages at the appropriate times

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomsgr: add a delay_until queue that is used to delay deliveries.
Greg Farnum [Tue, 27 Nov 2012 18:05:47 +0000 (10:05 -0800)]
msgr: add a delay_until queue that is used to delay deliveries.

Its life-cycle matches that of delay_queue, and the delayed_delivery
function respects it. For now queue_received is just setting it to
delay everything by 1 second.

Signed-off-by: Greg Farnum <greg@inktank.com>
12 years agomsgr: clear out the delay queue when stop()ing
Greg Farnum [Tue, 27 Nov 2012 17:44:19 +0000 (09:44 -0800)]
msgr: clear out the delay queue when stop()ing

After some brief thought, I believe deleting any messages in the
delay queue is correct -- we are trying to simulate line delays
in delivery and so anything still in the queue has supposedly
not arrived yet. So delete them when we stop the Pipe for
any reason.

Signed-off-by: Greg Farnum <greg@inktank.com>
12 years agomsgr: move the delay queue initialization into start_reader
Greg Farnum [Tue, 27 Nov 2012 19:02:07 +0000 (11:02 -0800)]
msgr: move the delay queue initialization into start_reader

The Pipe doesn't know the peer type in the constructor. It
doesn't always know in start_reader either, so this needs more work,
but at least it knows more frequently than it did.

Signed-off-by: Greg Farnum <greg@inktank.com>
12 years agomessenger: add the shell of a system to delay incoming Message delivery
Greg Farnum [Wed, 21 Nov 2012 18:54:06 +0000 (10:54 -0800)]
messenger: add the shell of a system to delay incoming Message delivery

When ms_inject_delay_type matches that of the incoming Connection,
the Pipe sets up a delay queue that it shuttles all Messages through.
This lets us check cleanup and some notification code but doesn't
actually generate any delays.

Signed-off-by: Greg Farnum <greg@inktank.com>
12 years agorgw: treat lack of swift token as anonymous user access
Yehuda Sadeh [Fri, 30 Nov 2012 00:04:41 +0000 (16:04 -0800)]
rgw: treat lack of swift token as anonymous user access

Fixes: 3534
If a swift token hasn't been provided, set user as anonymous.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
12 years agoMerge branch 'next'
Sage Weil [Thu, 29 Nov 2012 23:48:54 +0000 (15:48 -0800)]
Merge branch 'next'

Conflicts:
src/rgw/rgw_admin.cc

12 years agoMerge remote-tracking branch 'gh/wip_next_bugs' into next
Sage Weil [Thu, 29 Nov 2012 23:47:26 +0000 (15:47 -0800)]
Merge remote-tracking branch 'gh/wip_next_bugs' into next

12 years agoMerge remote-tracking branch 'gh/wip-mon-osd-create-fix' into next
Sage Weil [Thu, 29 Nov 2012 23:34:32 +0000 (15:34 -0800)]
Merge remote-tracking branch 'gh/wip-mon-osd-create-fix' into next

12 years agoradosgw-admin: close storage before exit
Yehuda Sadeh [Thu, 29 Nov 2012 23:30:17 +0000 (15:30 -0800)]
radosgw-admin: close storage before exit

Fixes: #3560
This will remove watches off notification objects.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
12 years agoosd: move next_osdmap under separate lock
Sage Weil [Thu, 29 Nov 2012 22:16:16 +0000 (14:16 -0800)]
osd: move next_osdmap under separate lock

It doesn't actually interfere with publish_lock, and the current osdmap
ref.

Document what is going on.

Always preceed publish_map() with one or more pre_publish_map() calls.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: fix leak of heartbeat con on reset
Sage Weil [Thu, 29 Nov 2012 22:05:06 +0000 (14:05 -0800)]
osd: fix leak of heartbeat con on reset

If we replace our old con, drop the reference.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: use safe con helpers for scrub
Sage Weil [Thu, 29 Nov 2012 20:36:01 +0000 (12:36 -0800)]
osd: use safe con helpers for scrub

Note that if we don't get a con our behavior largely does not matter, since
we know we are about to get a Reset event anyway.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: use safe con helpers from do_{infos,queries,notifies}
Sage Weil [Thu, 29 Nov 2012 19:34:28 +0000 (11:34 -0800)]
osd: use safe con helpers from do_{infos,queries,notifies}

Ensure we don't reopen connections to downloads.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: make _share_map_outgoing() use a Connection
Sage Weil [Thu, 29 Nov 2012 19:13:38 +0000 (11:13 -0800)]
osd: make _share_map_outgoing() use a Connection

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoclient: Fix for #3490 and config option to test
Sam Lang [Thu, 29 Nov 2012 18:19:51 +0000 (12:19 -0600)]
client:  Fix for #3490 and config option to test

If the mds revokes our cache cap, and we follow
the _read_sync() path, on a zero-byte file the
osd returns ENOENT.  We need to replace ENOENT
with a return of 0 in this case.

Signed-off-by: Sam Lang <sam.lang@inktank.com>
12 years agotest/libcephfs: Test reading an empty file
Sam Lang [Thu, 29 Nov 2012 18:14:19 +0000 (12:14 -0600)]
test/libcephfs:  Test reading an empty file

This tests a bug (#3490) in the Client::_read_sync
codepath, and should be run with conf->client_read_sync_always
set to true.

Signed-off-by: Sam Lang <sam.lang@inktank.com>
12 years agoPG: scrubber.end should be exactly a boundary
Samuel Just [Wed, 28 Nov 2012 23:10:43 +0000 (15:10 -0800)]
PG: scrubber.end should be exactly a boundary

Let scrubber.end be (foo, HEAD, 10) where the oid is foo , HEAD is the
snap, and 10 is the hash and scrubber.begin similarly be (bar, 5, 1).

After choosing to scan [(bar, 5, 1), (foo, HEAD, 10)), we block writes
on that interval.

1) A write might then come in for foo (which isn't blocked) which
creates a new snap (foo, 400, 10) which happens to fall in the interval.
This will result in a crash in _scrub() when it attempts to compare
clones since it will get (foo, 400, 10) but not the head object
(foo, HEAD, 10).

2) Alternately, the write from 1) has already happened.  When we scan
the log, we find 34'10 and 34'11 are the clone operation creating
(foo, 400, 10) and the modify on (foo, HEAD, 10) respectively.  Both
primary and replica will wait for last_update_applied to be 34'10
before scanning, but last_update_applied will in fact skip to 34'11
since 34'10 and 34'11 happened in the same transaction.  This can
result in IO hanging on the scrubber interval.

Instead, we ensure that scrubber.end is exactly a hash boundary
(min hobject_t a with the specified hash).  No such object can
exist since we don't create objects with empty oids, so no writes
can occur on that object.

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoReplicatedPG: remove from snap_collections even without objects to trim
Samuel Just [Thu, 15 Nov 2012 21:35:47 +0000 (13:35 -0800)]
ReplicatedPG: remove from snap_collections even without objects to trim

Also, make sure to write_info after updating snap_collections.

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoOSD: get_or_create_pg return null if pool is gone
Samuel Just [Thu, 29 Nov 2012 19:28:25 +0000 (11:28 -0800)]
OSD: get_or_create_pg return null if pool is gone

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoOSD: history.last_epoch_started should start at 0
Samuel Just [Wed, 28 Nov 2012 00:00:03 +0000 (16:00 -0800)]
OSD: history.last_epoch_started should start at 0

history.last_epoch_started marks a lower bound on the last epoch at
which the pg went active.  As with info.last_epoch_started, it should be
0 prior to the first activation.

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoPG: maintain osd local last_epoch_started for find_best_info
Samuel Just [Wed, 21 Nov 2012 21:59:22 +0000 (13:59 -0800)]
PG: maintain osd local last_epoch_started for find_best_info

In order to proceed with peering, we need an osd with a log including
the last commit sent to a client.  This translates to the oldest
last_update from the infos of the most recent acting set to go active.
history.last_epoch_started gives us a lower bound on the last time the
entire acting set persisted authoratative logs/infos.  However, it
doesn't indicate anything about the info/log on the osd which sent it.
Thus, we will maintain an osd local info.last_epoch_started to determine
which osds were actually active (and thus have the required log
entries).  The max info.last_epoch_started in the prior set gives us an
upper bound on the last interval during which writes occurred.  The min
last_update among the infos with that last_epoch_started must therefore
be an upper bound on the oldest operation which clients consider
committed.  Any osd with an info.last_updated past that version must be
sufficient.

The observed bug was there was an empty pg info with a
last_epoch_started at the most recent interval which pushed
min_last_update_acceptable to eversion_t().  There were two down osds,
but peering proceeded since the backfill peer did survive.  However,
its info was later disregarded due to incomplete.  An empty osd was
then chosen as the best_info since it's last_update was equal to
min_last_update_acceptable.  This caused the contents of the pg to be
lost.

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agohobject_t: make max private
Samuel Just [Thu, 29 Nov 2012 21:51:41 +0000 (13:51 -0800)]
hobject_t: make max private

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoScript to install and configure radosgw.
tamil [Thu, 29 Nov 2012 21:46:43 +0000 (13:46 -0800)]
Script to install and configure radosgw.

Signed-off-by: tamil <tamil.muthamizhan@inktank.com>
12 years agoMerge branch 'wip-mon-store-errorcheck' into next
Greg Farnum [Thu, 29 Nov 2012 21:23:02 +0000 (13:23 -0800)]
Merge branch 'wip-mon-store-errorcheck' into next

Reviewed-by: Joao Luis <joao.luis@inktank.com>
12 years agoMerge remote-tracking branch 'origin/wip-rgw-leak' into next
Yehuda Sadeh [Thu, 29 Nov 2012 21:07:09 +0000 (13:07 -0800)]
Merge remote-tracking branch 'origin/wip-rgw-leak' into next

Conflicts:
src/rgw/rgw_main.cc

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
12 years agomon: Monitor: don't allow '+' or '-' prefixed values on parse_pos_long()
Joao Eduardo Luis [Thu, 29 Nov 2012 21:03:05 +0000 (21:03 +0000)]
mon: Monitor: don't allow '+' or '-' prefixed values on parse_pos_long()

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
12 years agomon: OSDMonitor: return -EINVAL on not-a-uuid during 'osd create'
Joao Eduardo Luis [Thu, 29 Nov 2012 16:42:16 +0000 (16:42 +0000)]
mon: OSDMonitor: return -EINVAL on not-a-uuid during 'osd create'

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
12 years agoosd: fix Connection leaks
Sage Weil [Thu, 29 Nov 2012 19:11:17 +0000 (11:11 -0800)]
osd: fix Connection leaks

Messenger::get_connection() returns a reference.  Put it.

Signed-off-by: Sage Weil <sage@inktank.com>