]> git.apps.os.sepia.ceph.com Git - ceph.git/log
ceph.git
12 years agoos/FileJournal: logger is optional
Sage Weil [Fri, 28 Dec 2012 23:44:51 +0000 (15:44 -0800)]
os/FileJournal: logger is optional

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agolog: broadcast cond signals
Sage Weil [Fri, 28 Dec 2012 21:07:18 +0000 (13:07 -0800)]
log: broadcast cond signals

We were using a single cond, and only signalling one waiter.  That means
that if the flusher and several logging threads are waiting, and we hit
a limit, we the logger could signal another logger instead of the flusher,
and we could deadlock.

Similarly, if the flusher empties the queue, it might signal only a single
logger, and that logger could re-signal the flusher, and the other logger
could wait forever.

Intead, break the single cond into two: one for loggers, and one for the
flusher.  Always signal the (one) flusher, and always broadcast to all
loggers.

Backport: bobtail, argonaut
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
12 years agoMerge remote-tracking branch 'origin/wip-gl-docs'
Gary Lowell [Fri, 28 Dec 2012 22:15:37 +0000 (14:15 -0800)]
Merge remote-tracking branch 'origin/wip-gl-docs'

Update release process documentation.

12 years agodocs: fix typo in release-process doc
Gary Lowell [Fri, 28 Dec 2012 22:05:56 +0000 (14:05 -0800)]
docs:  fix typo in release-process doc

Signed-off-by: Gary Lowell <gary.lowell@inktank.com>
12 years agoceph-fuse: Avoid doing handle cleanup in dtor
Sam Lang [Fri, 28 Dec 2012 19:58:39 +0000 (13:58 -0600)]
ceph-fuse: Avoid doing handle cleanup in dtor

The CephFuse::Handle class needs the client
pointer to be valid for finalizing, so don't finalize
in the destructor (which doesn't get called till the
fuse handle leaves scope), instead use a finalize method that
gets called explicitly before the client pointer is freed.

Signed-off-by: Sam Lang <sam.lang@inktank.com>
12 years agoceph-fuse: Pass client handle as userdata
Sam Lang [Fri, 28 Dec 2012 19:10:04 +0000 (13:10 -0600)]
ceph-fuse:  Pass client handle as userdata

The fuse lowlevel API isn't getting the client
handle when when it gets initialized, resulting
in a null pointer for all the subsequent calls.

Signed-off-by: Sam Lang <sam.lang@inktank.com>
12 years agodoc: warn about using caching without QEMU knowing
Josh Durgin [Wed, 26 Dec 2012 22:25:51 +0000 (14:25 -0800)]
doc: warn about using caching without QEMU knowing

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agofeatures is uint64_t
Sage Weil [Fri, 28 Dec 2012 00:38:45 +0000 (16:38 -0800)]
features is uint64_t

This won't bite us for a while yet (we're on bit 26), but it will soon!

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
12 years agoMerge remote-tracking branch 'gh/next'
Sage Weil [Fri, 28 Dec 2012 01:15:29 +0000 (17:15 -0800)]
Merge remote-tracking branch 'gh/next'

12 years agoceph-fuse: Split main into init/main/finalize
Sam Lang [Thu, 6 Dec 2012 05:21:12 +0000 (23:21 -0600)]
ceph-fuse:  Split main into init/main/finalize

With the invalidate callback enabled for fuse, the Client::unmount
call requires the fuse channel and session objects remain for performing
the invalidate callbacks.  This patch splits the ceph_fuse_ll_main
call into init, main, and finalize functions, so finalization of the
channel and session objects can be done after the unmount completes.

The patch includes cleanup for the code in fuse_ll.cc to make it more
in the style of C++ and make use of the pimpl idiom to hide the fuse
structures within the CephFuse::Handle pimpl class.

Signed-off-by: Sam Lang <sam.lang@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
12 years agojava: remove deprecated libcephfs
Noah Watkins [Thu, 27 Dec 2012 20:06:02 +0000 (12:06 -0800)]
java: remove deprecated libcephfs

Removes ceph_set_default_*

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
12 years agoinit-ceph: fix status version check across machines
Sage Weil [Fri, 28 Dec 2012 00:06:24 +0000 (16:06 -0800)]
init-ceph: fix status version check across machines

The local state isn't propagated into the backtick shell, resulting in
'unknown' for all remote daemons.  Avoid backticks altogether.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agodocs: update release process documentation.
Gary Lowell [Thu, 27 Dec 2012 23:39:46 +0000 (15:39 -0800)]
docs:  update release process documentation.

Signed-off-by: Gary Lowell <gary.lowell@inktank.com>
12 years agoMerge remote-tracking branch 'gh/wip-mds'
Sage Weil [Thu, 27 Dec 2012 21:40:01 +0000 (13:40 -0800)]
Merge remote-tracking branch 'gh/wip-mds'

12 years agoosd: fix recovery assert for pg repair case
Sage Weil [Wed, 26 Dec 2012 23:27:07 +0000 (15:27 -0800)]
osd: fix recovery assert for pg repair case

In the case of PG repair, this assert is not valid.  Disable it for now.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoMerge branch 'wip-osd-flags'
Sage Weil [Thu, 27 Dec 2012 21:09:24 +0000 (13:09 -0800)]
Merge branch 'wip-osd-flags'

12 years agoMerge remote-tracking branch 'gh/wip-mds-pool'
Sage Weil [Thu, 27 Dec 2012 21:07:57 +0000 (13:07 -0800)]
Merge remote-tracking branch 'gh/wip-mds-pool'

Reviewed-by: Sam Lang <sam.lang@inktank.com>
12 years agoosd: only calculate OpRequest rmw flags once
Sage Weil [Fri, 7 Dec 2012 21:28:55 +0000 (13:28 -0800)]
osd: only calculate OpRequest rmw flags once

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomessages/MOSDOpReply: remove misleading may_read/may_write
Sage Weil [Fri, 7 Dec 2012 21:18:50 +0000 (13:18 -0800)]
messages/MOSDOpReply: remove misleading may_read/may_write

These are OpRequest properties, calculated/enforced at the OSD.  They don't
belong in the MOSDOp or MOSDOpReply messages.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: move rmw_flags to OpRequest, out of MOSDOp
Sage Weil [Fri, 7 Dec 2012 21:14:26 +0000 (13:14 -0800)]
osd: move rmw_flags to OpRequest, out of MOSDOp

It was very sloppy to put a server-side processing state inside the
messsage.  Move it to the OpRequestRef instead.

Note that the client was filling in bogus data that was then lost during
encoding/decoding; clean that up.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agodropping xfs test 186 due to bug: 3685
tamil [Thu, 27 Dec 2012 19:27:31 +0000 (11:27 -0800)]
dropping xfs test 186 due to bug: 3685

Signed-off-by: tamil <tamil.muthamizhan@inktank.com>
12 years agodocs: remove extra release-process2 file.
Gary Lowell [Thu, 27 Dec 2012 19:12:27 +0000 (11:12 -0800)]
docs:  remove extra release-process2 file.

This file mostly duplicated the existing release documentation.  Differences
have been merged into the primary file.

Signed-off-by: Gary Lowell <gary.lowell@inktank.com>
12 years agoosd: drop 'osd recovery max active' back to previous default (5)
Sage Weil [Thu, 27 Dec 2012 19:12:33 +0000 (11:12 -0800)]
osd: drop 'osd recovery max active' back to previous default (5)

Having this too large means that queues get too deep on the OSDs during
backfill and latency is very high.  In my tests, it also meant we generated
a lot of slow recovery messages just from the recovery ops themselves (no
client io).

Keeping this at the old default means we are no worse in this respect than
argonaut, which is a safe position to start from.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agojournal: reduce journal max queue size
Sage Weil [Thu, 27 Dec 2012 19:11:08 +0000 (11:11 -0800)]
journal: reduce journal max queue size

Keep the journal queue size smaller than the filestore queue size.

Keeping this small also means that we can lower the latency for new
high priority ops that come into the op queue.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomds: use set to store MDSMap data pools
Sage Weil [Thu, 27 Dec 2012 19:09:00 +0000 (11:09 -0800)]
mds: use set to store MDSMap data pools

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomds: wait for client's mdsmap when specifying data pool
Sage Weil [Wed, 26 Dec 2012 18:45:08 +0000 (10:45 -0800)]
mds: wait for client's mdsmap when specifying data pool

The client may have a newer map than we do; make sure we wait for it lest
we inadvertantly reply because we think the pool doesn't exist.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agodoc: document mds config options
Sage Weil [Thu, 27 Dec 2012 17:33:27 +0000 (09:33 -0800)]
doc: document mds config options

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agodoc: journaler config options
Sage Weil [Wed, 26 Dec 2012 18:58:58 +0000 (10:58 -0800)]
doc: journaler config options

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agodocs: Merge changes from release-process2 document.
Gary Lowell [Wed, 26 Dec 2012 20:54:27 +0000 (12:54 -0800)]
docs:  Merge changes from release-process2 document.

12 years agomds: add waiting_for_mdsmap queue
Sage Weil [Wed, 26 Dec 2012 18:40:53 +0000 (10:40 -0800)]
mds: add waiting_for_mdsmap queue

Defer events until we get a specific MDSMap epoch.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomds: do not check for pool existence in osdmap
Sage Weil [Wed, 26 Dec 2012 19:42:04 +0000 (11:42 -0800)]
mds: do not check for pool existence in osdmap

We don't have a wait mechanism to ensure the MSDMap has the latest osdmap
here.  Just trust the MDSMap.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoqa: remove xfstests 172 and 173 from qemu testing
Josh Durgin [Wed, 26 Dec 2012 18:55:47 +0000 (10:55 -0800)]
qa: remove xfstests 172 and 173 from qemu testing

These seem to require newer xfs.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agodoc/man/8/mkcephfs: update --mkfs a bit
Sage Weil [Sat, 22 Dec 2012 01:00:48 +0000 (17:00 -0800)]
doc/man/8/mkcephfs: update --mkfs a bit

Document that 'devs' and 'osd mkfs type' must be defined.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomds: replace closed sessions on connect
Sage Weil [Sun, 16 Dec 2012 20:26:06 +0000 (12:26 -0800)]
mds: replace closed sessions on connect

If a connection comes and there is a closed session attached, remove it.
This is probably a failure of an old session to get cleaned up properly,
and in certain cases it may even be from a different client (if the addr
nonce is reused).  In that case this prevents further damage, although
a complete solution would also clean up the closed connection state if
there is a fault.  See #3630.

This fixes a hang that is reproduced by running the libcephfs
Caps.ReadZero test in a loop; eventually the client addr is reused and
we are linked to an ancient Session with a different client id.

Backport: bobtail
Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomds: don't force in->first == dn->first
Sage Weil [Sat, 3 Mar 2012 23:39:06 +0000 (15:39 -0800)]
mds: don't force in->first == dn->first

The fullbit sets it now.  For multiversion inodes, it's "first" can be in
the future, since this dentry may not have changed when the inode was
cowed in place.  (OTOH, the dentry cannot have changed without the inode
also have changing.)

Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
12 years agomds: compare sessionmap version before replaying imported sessions
Yan, Zheng [Fri, 30 Nov 2012 01:46:31 +0000 (09:46 +0800)]
mds: compare sessionmap version before replaying imported sessions

Otherwise we may wrongly increase mds->sessionmap.version, which
will confuse future journal replays that involving sessionmap.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
12 years agomds: fix race between send_dentry_link() and cache expire
Yan, Zheng [Sat, 8 Dec 2012 14:43:32 +0000 (22:43 +0800)]
mds: fix race between send_dentry_link() and cache expire

MDentryLink message can race with cache expire, When it arrives at
the target MDS, it's possible there is no corresponding dentry in
the cache. If this race happens, we should expire the replica inode
encoded in the MDentryLink message. But to expire an inode, the MDS
need to know which subtree does the inode belong to, so modify the
MDentryLink message to include this information.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
12 years agomds: fix file existing check in Server::handle_client_openc()
Yan, Zheng [Mon, 10 Dec 2012 07:43:44 +0000 (15:43 +0800)]
mds: fix file existing check in Server::handle_client_openc()

Creating new file needs to be handled by directory fragment's auth
MDS, opening existing file in write mode needs to be handled by
corresponding inode's auth MDS. If a file is remote link, its parent
directory fragment's auth MDS can be different from corresponding
inode's auth MDS. So which MDS to handle create file request can be
affected by if the corresponding file already exists.

handle_client_openc() calls rdlock_path_xlock_dentry() at the very
beginning. It always assumes the request needs to be handled by
directory fragment's auth MDS. When handling a create file request,
if the file already exists and remotely linked to a non-auth inode,
handle_client_openc() falls back to handle_client_open(),
handle_client_open() forwards the request because the MDS is not
inode's auth MDS. Then when the request arrives at inode's auth MDS,
rdlock_path_xlock_dentry() is called, it will forward the request
back.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
12 years agomds: delay processing cache expire when state >= EXPORT_EXPORTING
Yan, Zheng [Mon, 10 Dec 2012 02:06:28 +0000 (10:06 +0800)]
mds: delay processing cache expire when state >= EXPORT_EXPORTING

It's possible that MDS receives cache expire in EXPORT_LOGGINGFINISH
and EXPORT_NOTIFYING states.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
12 years agomds: don't retry readdir request after issuing caps
Yan, Zheng [Sun, 9 Dec 2012 05:03:41 +0000 (13:03 +0800)]
mds: don't retry readdir request after issuing caps

If remote linkage without inode is encountered after some caps are
issued, Server::handle_client_readdir() should send the reply to
client immediately instead of retrying the request after opening
the remote dentry. This is because the MDS may want to revoke these
caps before the MDS succeeds in opening the remote dentry.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
12 years agomds: take export lock set before sending MExportDirDiscover
Yan, Zheng [Sat, 8 Dec 2012 16:53:28 +0000 (00:53 +0800)]
mds: take export lock set before sending MExportDirDiscover

Migrator::export_dir() only check if it can lock the export lock set
but not take the lock set. So someone else can change the path to
the exporting dir and confuse Migrator::handle_export_discover().

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
12 years agomds: re-issue caps after importing caps
Yan, Zheng [Sun, 9 Dec 2012 05:06:33 +0000 (13:06 +0800)]
mds: re-issue caps after importing caps

The imported caps may prevent unstable locks from entering stable
states. So we should call Locker::eval_gather() with parameter
"first" set to true after caps are imported.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
12 years agomds: always send discover if want_xlocked is true
Yan, Zheng [Fri, 7 Dec 2012 08:02:08 +0000 (16:02 +0800)]
mds: always send discover if want_xlocked is true

If want_xlocked is true, we can not rely on previously sent discover
because it's likely the previous discover is blocking on the xlocked
dentry.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
12 years agomds: fix error hanlding in MDCache::handle_discover_reply()
Yan, Zheng [Sat, 8 Dec 2012 07:07:53 +0000 (15:07 +0800)]
mds: fix error hanlding in MDCache::handle_discover_reply()

The error hanlding code in MDCache::handle_discover_reply() has two
main issues. MDCache::handle_discover_reply() does not wake waiters
if dir_auth_hint in reply message is equal to itself's nodeid. This
can happen if discover race with subtree importing. Another issue is
that it checks the existence of cached directory fragment to decide
if it should take waiter from inode or from directory fragment. The
check is unreliable because subtree importing can add directory
fragments to the cache.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
12 years agomds: set want_base_dir to false for MDCache::discover_ino()
Yan, Zheng [Sat, 8 Dec 2012 05:59:38 +0000 (13:59 +0800)]
mds: set want_base_dir to false for MDCache::discover_ino()

When frozen inode is encountered, MDCache::handle_discover() sends
reply immediately if the reply message is not empty. When handling
"discover ino" requests, the reply message always contains the base
directory fragment. But requestor already has the base directory
fragment, the only effect of the reply message is wake the requestor
and make it send same "discover ino" request again. So the requestor
keeps sending "discover ino" requests but can't make any progress.

The fix is set want_base_dir to false for MDCache::discover_ino().
After set want_base_dir to false, also need update the code that
handles "discover ino" error.

This patch also remove unused error handling code for flag_error_dn

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
12 years agomds: no bloom filter for replica dir
Yan, Zheng [Fri, 7 Dec 2012 07:59:56 +0000 (15:59 +0800)]
mds: no bloom filter for replica dir

We should delete dir fragment's bloom filter after exporting the dir
fragment to other MDS. Otherwise the residual bloom filter may cause
problem if the MDS imports dir fragment later.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
12 years agomds: properly mark dirfrag dirty
Yan, Zheng [Thu, 6 Dec 2012 01:28:46 +0000 (09:28 +0800)]
mds: properly mark dirfrag dirty

If predirty_journal_parents() does not propagate changes in dir's
fragstat into corresponding inode's dirstat, it should mark the
inode as dirfrag dirty. This happens when we modify dir fragments
that are auth subtree roots.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
12 years agomds: alllow handle_client_readdir() fetching freezing dir.
Yan, Zheng [Fri, 30 Nov 2012 00:53:33 +0000 (08:53 +0800)]
mds: alllow handle_client_readdir() fetching freezing dir.

At that point, the request already auth pins and locks some objects.
So CDir::fetch() should ignore the can_auth_pin check and continue
to fetch freezing dir.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
12 years agoMerge branch 'wip-create-layout'
Sage Weil [Mon, 24 Dec 2012 03:59:04 +0000 (19:59 -0800)]
Merge branch 'wip-create-layout'

Reviewed-by: Greg Farnum <greg@inktank.com>
The functional tests for the create operations should add and specify non-default
pools, but we don't have a set of library methods to do that yet (to interact with
the monitor).

12 years agomds: *_pg_pool -> *_pool
Sage Weil [Fri, 7 Dec 2012 20:45:19 +0000 (12:45 -0800)]
mds: *_pg_pool -> *_pool

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoclient, libcephfs: add method to get the pool name for an open file
Sage Weil [Fri, 7 Dec 2012 09:16:53 +0000 (01:16 -0800)]
client, libcephfs: add method to get the pool name for an open file

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoclient: specify data pool on create operations
Sage Weil [Thu, 6 Dec 2012 08:12:17 +0000 (00:12 -0800)]
client: specify data pool on create operations

Fill in the data pool field if specified by the client, or set to -1.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomds: verify that the pool id is valid on SET[DIR]LAYOUT
Sage Weil [Thu, 6 Dec 2012 08:11:00 +0000 (00:11 -0800)]
mds: verify that the pool id is valid on SET[DIR]LAYOUT

Make sure the data pool exists and is part of the MDSMap data pools list.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomds: allow data pool to be specfied on create
Sage Weil [Thu, 6 Dec 2012 08:10:29 +0000 (00:10 -0800)]
mds: allow data pool to be specfied on create

Reuse old preferred_pg field.  Only use if the new CREATEPOOLID feature
is present, and the value is >= 0.

Verify that the data pool is allowed, or return EINVAL to the client.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoclient: remove set_default_*() methods
Sage Weil [Thu, 6 Dec 2012 06:03:34 +0000 (22:03 -0800)]
client: remove set_default_*() methods

This is a poor interface.  The hadoop stuff is shifting to specify this
information on file creation instead.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: fix dup failure cancellations
Sage Weil [Sun, 23 Dec 2012 23:17:12 +0000 (15:17 -0800)]
osd: fix dup failure cancellations

If we had a pending failure report, and send a cancellation, take it
out of our pending list so that we don't keep resending cancellations.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: make MOSDFailure output more sensible
Sage Weil [Sun, 23 Dec 2012 23:16:06 +0000 (15:16 -0800)]
osd: make MOSDFailure output more sensible

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomon: make osd failure report log msgs sensible
Sage Weil [Sun, 23 Dec 2012 23:11:39 +0000 (15:11 -0800)]
mon: make osd failure report log msgs sensible

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoMerge branch 'wip-scrub' into next
Sage Weil [Sun, 23 Dec 2012 22:42:51 +0000 (14:42 -0800)]
Merge branch 'wip-scrub' into next

Reviewed-by: Sage Weil <sage@inktank.com>
Conflicts:
src/osd/PG.cc

12 years agomonclient: fix get_monmap_privately retry interval
Sage Weil [Sun, 23 Dec 2012 21:29:08 +0000 (13:29 -0800)]
monclient: fix get_monmap_privately retry interval

Use mon_client_hunt_interval (default 3) instead of hardcoding 1 second.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoMakefile: fix 'base' rule
Sage Weil [Sun, 23 Dec 2012 04:56:45 +0000 (20:56 -0800)]
Makefile: fix 'base' rule

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoMerge branch 'next'
Sage Weil [Sun, 23 Dec 2012 19:19:39 +0000 (11:19 -0800)]
Merge branch 'next'

12 years agoinit-ceph,mkcephfs: default inode64 for mounting xfs
Sage Weil [Sun, 23 Dec 2012 19:18:45 +0000 (11:18 -0800)]
init-ceph,mkcephfs: default inode64 for mounting xfs

According to hch this is now the default or new kernels.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoinit-ceph: default osd_data path
Sage Weil [Sat, 22 Dec 2012 19:10:03 +0000 (11:10 -0800)]
init-ceph: default osd_data path

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoOSD: always do a deep scrub when repairing
Samuel Just [Sat, 22 Dec 2012 01:21:59 +0000 (17:21 -0800)]
OSD: always do a deep scrub when repairing

Otherwise, errors turned up in a deep-scrub will be
swept under the rug without being repaired.

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoPG: don't use a self-transition for WaitRemoteRecoveryReserved
Samuel Just [Sat, 22 Dec 2012 00:51:40 +0000 (16:51 -0800)]
PG: don't use a self-transition for WaitRemoteRecoveryReserved

Previously, using the state on active worked, but now we might
go back through WaitRemoteRecoveryReserved without resetting
Active.

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoPG: Handle repair once in scrub_finish
Samuel Just [Fri, 21 Dec 2012 23:39:50 +0000 (15:39 -0800)]
PG: Handle repair once in scrub_finish

We don't want to change missing sets during a chunky
scrub since it would cause !is_clean() and derail
the rest of the scrub.  Instead, move the missing,
inconsistent, and authoritative sets into scrubber
and add to during scrub_compare_maps().  Then,
handle repairing objects all at once in scrub_finish().

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoimport_export.sh: sparse import export
Dan Mick [Fri, 21 Dec 2012 03:53:07 +0000 (19:53 -0800)]
import_export.sh: sparse import export

Add tests for:
   - sparse import makes expected sparse images
   - sparse export makes expected sparse files
   - sparse import from stdin also creates sparse images
   - import from partially-sparse file leads to partially-sparse image
   - import from stdin with zeros leads to sparse
   - export from zeros-image to file leads to sparse file

Signed-off-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
12 years agorbd: harder-working sparse import from stdin
Dan Mick [Sat, 8 Dec 2012 06:57:06 +0000 (22:57 -0800)]
rbd: harder-working sparse import from stdin

Try to accumulate image-sized blocks when importing from stdin, even if
each read is shorter than requested; if we get a full block, and it's
all zeroes, we can seek and make a sparse output file

Signed-off-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
12 years agorbd: check for all-zero buf in export, seek output if so
Dan Mick [Thu, 20 Dec 2012 22:00:12 +0000 (14:00 -0800)]
rbd: check for all-zero buf in export, seek output if so

Use buf_is_zero in common/util.cc

Signed-off-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
12 years agolibrbd: move buf_is_zero() to new common/util.cc and include/util.h
Dan Mick [Thu, 20 Dec 2012 21:58:55 +0000 (13:58 -0800)]
librbd: move buf_is_zero() to new common/util.cc and include/util.h

Signed-off-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
12 years agoosd: fix pg stat msgs vs timeout
Sage Weil [Sat, 22 Dec 2012 00:47:50 +0000 (16:47 -0800)]
osd: fix pg stat msgs vs timeout

We can get a pattern like so:

- new mon session
- after say 120 seconds, we decide to send a stats msg
- outstanding_pg_stats is finally true, we immediately time out (30 second
  grace), and reconnect to a new mon
-> repeat

The problem is that we don't reset the last_sent timestamp when we send.
Or that we do this check after sending instead of before.  Fix both.

This should resolve the issue #3661 where osds that don't have pgs
updating are not stats messags to the mon to check in, and are eventually
getting marked down as a result.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
12 years agodoc: Added new journaler page to CephFS section. Needs descriptions.
John Wilkins [Sat, 22 Dec 2012 00:14:53 +0000 (16:14 -0800)]
doc: Added new journaler page to CephFS section. Needs descriptions.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
12 years agodoc: Added Journaler Configuration to toc tree.
John Wilkins [Sat, 22 Dec 2012 00:14:23 +0000 (16:14 -0800)]
doc: Added Journaler Configuration to toc tree.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
12 years agodoc: Added --mkfs options.
John Wilkins [Sat, 22 Dec 2012 00:09:09 +0000 (16:09 -0800)]
doc: Added --mkfs options.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
12 years agodoc: Added running multiple clusters. Per Tommi.
John Wilkins [Sat, 22 Dec 2012 00:08:05 +0000 (16:08 -0800)]
doc: Added running multiple clusters. Per Tommi.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
12 years agodoc: Updated the Configuration File section.
John Wilkins [Sat, 22 Dec 2012 00:07:27 +0000 (16:07 -0800)]
doc: Updated the Configuration File section.

- Replaced ceph.conf with Ceph configuration to clarify
  when running multiple clusters on the same hardware.
- Added a [client] entry so people know it can be set too.
- Updated existing auth example.
- Added an authentication section with a link to the cephx guide.
- Added section for running multiple clusters. Per Tommi.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
12 years agoPG::scrub_compare_maps increment scrubber.fixed for missing repairs
Samuel Just [Fri, 21 Dec 2012 23:20:22 +0000 (15:20 -0800)]
PG::scrub_compare_maps increment scrubber.fixed for missing repairs

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoPG::_compare_scrubmaps: increment scrubber.errors on missing object
Samuel Just [Fri, 21 Dec 2012 23:16:19 +0000 (15:16 -0800)]
PG::_compare_scrubmaps: increment scrubber.errors on missing object

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agodoc: Added sudo the ceph health for when cephx is on.
John Wilkins [Fri, 21 Dec 2012 22:54:18 +0000 (14:54 -0800)]
doc: Added sudo the ceph health for when cephx is on.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
12 years agodoc: minor fix to syntax.
John Wilkins [Fri, 21 Dec 2012 22:53:28 +0000 (14:53 -0800)]
doc: minor fix to syntax.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
12 years agomkcephfs: error out if 'devs' defined but 'osd fs type' not defined
Sage Weil [Fri, 21 Dec 2012 22:23:14 +0000 (14:23 -0800)]
mkcephfs: error out if 'devs' defined but 'osd fs type' not defined

We can infer btrfs if they use btrfs devs, but if they use devs there is
no default fs.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agodoc: update ceph.conf examples about btrfs default
Sage Weil [Fri, 21 Dec 2012 22:04:30 +0000 (14:04 -0800)]
doc: update ceph.conf examples about btrfs default

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoMerge remote-tracking branch 'gh/wip-scrub' into next
Sage Weil [Fri, 21 Dec 2012 21:56:16 +0000 (13:56 -0800)]
Merge remote-tracking branch 'gh/wip-scrub' into next

12 years agoMerge remote-tracking branch 'gh/wip-3643' into next
Sage Weil [Fri, 21 Dec 2012 21:45:39 +0000 (13:45 -0800)]
Merge remote-tracking branch 'gh/wip-3643' into next

Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
12 years agomonc: only warn about missing keyring if we fail to authenticate
Sage Weil [Fri, 21 Dec 2012 21:44:19 +0000 (13:44 -0800)]
monc: only warn about missing keyring if we fail to authenticate

This avoids the situation where a librados or other user with the default
of 'cephx,none' and no keyring is authenticating against a cluster with
required of 'none' and an annoying warning is generated every time.  Now
we only print a helpful message if we actually failed.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: clear CLEAN on exit from Clean state
Sage Weil [Fri, 21 Dec 2012 19:44:35 +0000 (11:44 -0800)]
osd: clear CLEAN on exit from Clean state

This means we can drop the scrub repair state_clear() call.  We probably
can drop others, but lets leave that for another day.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoauth: use none auth if keyring not found
Yehuda Sadeh [Fri, 21 Dec 2012 20:14:40 +0000 (12:14 -0800)]
auth: use none auth if keyring not found

If both cephx and none are accepted auth methods, and
cephx keyring cannot be found then resort to using
none, instead of failing.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
12 years agoPG::sched_scrub: only set PG_STATE_DEEP_SCRUB once reserved
Samuel Just [Fri, 21 Dec 2012 19:36:04 +0000 (11:36 -0800)]
PG::sched_scrub: only set PG_STATE_DEEP_SCRUB once reserved

Otherwise we would have +DEEP before we have +SCRUB.

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoPG::sched_scrub: return true if scrub newly kicked off
Samuel Just [Fri, 21 Dec 2012 19:33:45 +0000 (11:33 -0800)]
PG::sched_scrub: return true if scrub newly kicked off

The previous return value wasn't really what OSD::sched_scrub
wanted to know.

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoosd: allow transition from Clean -> WaitLocalRecoveryReserved for repair
Sage Weil [Fri, 21 Dec 2012 19:37:48 +0000 (11:37 -0800)]
osd: allow transition from Clean -> WaitLocalRecoveryReserved for repair

If we do a scrub repair, we need to go from clean to recovery again to
copy objects around.

This fixes a simple repair of a missing object, either on the primary or
replica.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoPG: in sched_scrub() set PG_STATE_DEEP_SCRUB not scrubber.deep
Samuel Just [Fri, 21 Dec 2012 19:17:23 +0000 (11:17 -0800)]
PG: in sched_scrub() set PG_STATE_DEEP_SCRUB not scrubber.deep

scrubber.deep gets reset in scrub() to match
state_test(PG_STATE_DEEP_SCRUB).

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoosd: clear scrub state if queued scrub doesn't start
Sage Weil [Fri, 21 Dec 2012 06:01:34 +0000 (22:01 -0800)]
osd: clear scrub state if queued scrub doesn't start

We set SCRUBBING when we queue a pg for scrub.  If we dequeue and
call scrub() but abort for some reason (!active, degraded, etc.), clear
that state bit.

Bug is easily reproduced with 'ceph osd scrub N' during cluster startup
when PGs are peering; some PGs can get left in the scrubbing state.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agodoc: Moved path to individual OSD entires.
John Wilkins [Fri, 21 Dec 2012 18:15:38 +0000 (10:15 -0800)]
doc: Moved path to individual OSD entires.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
12 years agoosd: only dec_scrubs_active if we were active
Sage Weil [Fri, 21 Dec 2012 05:45:09 +0000 (21:45 -0800)]
osd: only dec_scrubs_active if we were active

This fixes a bug that puts scrubs_active negative.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: reintroduce inc_scrubs_active helper
Sage Weil [Fri, 21 Dec 2012 05:44:34 +0000 (21:44 -0800)]
osd: reintroduce inc_scrubs_active helper

This mostly generates nice debug output.  It also slightly simplifies
code and makes things symmetric.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoMerge remote-tracking branch 'gh/next'
Sage Weil [Fri, 21 Dec 2012 01:43:51 +0000 (17:43 -0800)]
Merge remote-tracking branch 'gh/next'

12 years agoMerge remote-tracking branch 'upstream/wip_notify' into next
Samuel Just [Fri, 21 Dec 2012 00:23:23 +0000 (16:23 -0800)]
Merge remote-tracking branch 'upstream/wip_notify' into next

Reviewed-by: Sage Weil <sage@inktank.com>
12 years agocephtool: mention ceph osd ls, fix ceph osd tell N bench
Dan Mick [Thu, 20 Dec 2012 23:31:21 +0000 (15:31 -0800)]
cephtool: mention ceph osd ls, fix ceph osd tell N bench

Add ceph osd ls to help; make help for ceph osd tell N bench look
more like injectargs, which says <osd-id or *> to make it clear you
can benchmark all osds simultaneously

Signed-off-by: Dan Mick <dan.mick@inktank.com>
12 years agorgw: remove noisy log message
Yehuda Sadeh [Thu, 20 Dec 2012 23:32:59 +0000 (15:32 -0800)]
rgw: remove noisy log message

No need for that log message.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>