]> git.apps.os.sepia.ceph.com Git - ceph.git/log
ceph.git
11 years agoFix some style and checking issue 2152/head
Xiaoxi Chen [Mon, 28 Jul 2014 16:42:10 +0000 (00:42 +0800)]
Fix some style and checking issue

Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>
11 years agoPGMonitor: fix bug in caculating pool avail space
Xiaoxi Chen [Mon, 28 Jul 2014 08:54:48 +0000 (16:54 +0800)]
PGMonitor: fix bug in caculating pool avail space

Currently for pools with different rules, "ceph df" cannot report
right available space for them, respectively. For detail assisment
of the bug ,pls refer to bug report #8943

This patch fix this bug and make ceph df works correctlly.

Fixes Bug #8943

Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>
11 years agoRevert "Merge pull request #2129 from ceph/wip-librbd-oc"
Sage Weil [Sun, 27 Jul 2014 04:19:34 +0000 (21:19 -0700)]
Revert "Merge pull request #2129 from ceph/wip-librbd-oc"

This reverts commit 74b386f03e4ca9970256db72c575589aea077534, reversing
changes made to 36265d0db0d7c0eb31d25a0f77ac233b3fd198f8.

The dirty_or_tx list is used by flush_set, which means we can
resubmit new IOs for writes that are already in progress.  This
has a compounding effect that overwhelms the OSDs with dup IOs
and stalls out the client.

See, for example, teh failues in this run:
  /a/sage-2014-07-25_17:14:20-fs-wip-msgr-testing-basic-plana

The fix is probably pretty simple, but reverting for now to make
the tests pass.

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoMerge remote-tracking branch 'gh/next'
Sage Weil [Sat, 26 Jul 2014 04:42:35 +0000 (21:42 -0700)]
Merge remote-tracking branch 'gh/next'

Conflicts:
src/osdc/Journaler.h

11 years agomds: fix journal reformat failure in standbyreplay
John Spray [Thu, 17 Jul 2014 12:15:45 +0000 (13:15 +0100)]
mds: fix journal reformat failure in standbyreplay

In the 0.82 release, standbyreplay MDS daemons would try
to reformat the jouranl if they saw an older version on
disk, where this should have only been done by the active
MDS for the rank.  Depending on timing, this could cause
fatal corruption of the journal.

This change handles the following cases:
* only do reformat if not in standbyreplay (else raise EAGAIN
to keep trying til an active mds reformats it)
* if journal header goes away while in standbyreplay then raise
EAGAIN (handle rewrite happening in background)
* if journal version is greater than the max supported, suicide

Fixes: #8811
Signed-off-by: John Spray <john.spray@redhat.com>
(cherry picked from commit 5438500af8979fda32e61714ae40b71c7ffdfd15)

11 years agoMerge pull request #2112 from ceph/wip-rbd-defaults
Sage Weil [Fri, 25 Jul 2014 22:23:25 +0000 (15:23 -0700)]
Merge pull request #2112 from ceph/wip-rbd-defaults

respect rbd_default_* parameters in /usr/bin/rbd

Reviewed-by: Sage Weil <sage@redhat.com>
11 years agoMerge pull request #2145 from ceph/wip-ref-put
Dan Mick [Fri, 25 Jul 2014 20:19:42 +0000 (13:19 -0700)]
Merge pull request #2145 from ceph/wip-ref-put

common/RefCountedObject: fix use-after-free in debug print

Reviewed-by: Dan Mick <dan.mick@inktank.com>
11 years agocommon/RefCountedObject: fix use-after-free in debug print 2145/head
Sage Weil [Fri, 25 Jul 2014 20:17:32 +0000 (13:17 -0700)]
common/RefCountedObject: fix use-after-free in debug print

We could race with another thread that deletes this right after we call
dec().  Our access of cct would then become a use-after-free.  Valgrind
managed to turn this up.

Copy it into a local variable before the dec() to be safe, and move the
dout line below to make this possibility explicit and obvious in the code.

Signed-off-by: Sage Weil <sage@redhat.com>
11 years agoMerge pull request #2143 from ceph/wip-rgw-align
Josh Durgin [Fri, 25 Jul 2014 18:36:29 +0000 (11:36 -0700)]
Merge pull request #2143 from ceph/wip-rgw-align

Wip rgw align

Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorgw: object write should not exceed part size 2143/head
Yehuda Sadeh [Thu, 24 Jul 2014 22:30:27 +0000 (15:30 -0700)]
rgw: object write should not exceed part size

Fixes: #8928
This can happen if the stripe size is not a multiple of the chunk size.

Backport: firefly

Signed-off-by: Yehuda Sadeh <yehuda@redhat.com>
11 years agorgw: align object chunk size with pool alignment
Yehuda Sadeh [Tue, 22 Jul 2014 22:30:11 +0000 (15:30 -0700)]
rgw: align object chunk size with pool alignment

Fixes: #8442
Backport: firefly
Data pools might have strict write alignment requirements. Use pool
alignment info when setting the max_chunk_size for the write.

Signed-off-by: Yehuda Sadeh <yehuda@redhat.com>
11 years agoMerge pull request #2141 from ceph/wip-8882
Sage Weil [Fri, 25 Jul 2014 17:34:33 +0000 (10:34 -0700)]
Merge pull request #2141 from ceph/wip-8882

osd: set pg flag INCOMPLETE_CLONES when turning off cache pool

Reviewed-by: Greg Farnum <greg@inktank.com>
First patch Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>

11 years agodoc: Add additional hyperlink to Cache Tiering defaults.
John Wilkins [Fri, 25 Jul 2014 16:55:52 +0000 (09:55 -0700)]
doc: Add additional hyperlink to Cache Tiering defaults.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
11 years agodoc: Update doc from user feedback.
John Wilkins [Fri, 25 Jul 2014 16:55:28 +0000 (09:55 -0700)]
doc: Update doc from user feedback.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
11 years agoMerge pull request #2142 from ceph/wip-data-pool
Sage Weil [Fri, 25 Jul 2014 16:03:34 +0000 (09:03 -0700)]
Merge pull request #2142 from ceph/wip-data-pool

test: catch a straggler still using 'data' pool

Reviewed-by: Sage Weil <sage@redhat.com>
11 years agotest: catch a straggler still using 'data' pool 2142/head
John Spray [Fri, 25 Jul 2014 16:01:39 +0000 (17:01 +0100)]
test: catch a straggler still using 'data' pool

Used rbd pool instead, which is still created by default.

Signed-off-by: John Spray <john.spray@redhat.com>
11 years agodoc: Updated mon doc per feedback. Fixed hyperlinks.
John Wilkins [Thu, 24 Jul 2014 23:00:52 +0000 (16:00 -0700)]
doc: Updated mon doc per feedback. Fixed hyperlinks.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
11 years agoMerge pull request #2079 from nereocystis/seq_read_bench-args
Gregory Farnum [Thu, 24 Jul 2014 21:36:21 +0000 (14:36 -0700)]
Merge pull request #2079 from nereocystis/seq_read_bench-args

Make the declaration argument names match those in the implementation (as used by callers).

Reviewed-by: Greg Farnum <greg@inktank.com>
11 years agodoc: update radosgw man page with available opts
Abhishek Lekshmanan [Thu, 24 Jul 2014 15:00:43 +0000 (20:30 +0530)]
doc: update radosgw man page with available opts

Fixes:#8112

Reviewed-by: Yehuda Sadeh <yehuda@redhat.com>
Signed-off-by: Abhishek Lekshmanan <abhishek.lekshmanan@gmail.com>
11 years agorgw: list all available options during help()
Abhishek Lekshmanan [Thu, 24 Jul 2014 15:00:42 +0000 (20:30 +0530)]
rgw: list all available options during help()

Adding the available help arguments from the man page

Fixes: #8112
Reviewed-by: Yehuda Sadeh <yehuda@redhat.com>
Signed-off-by: Abhishek Lekshmanan <abhishek.lekshmanan@gmail.com>
11 years agorgw: format help options to align with the rest
Abhishek Lekshmanan [Thu, 24 Jul 2014 15:00:41 +0000 (20:30 +0530)]
rgw: format help options to align with the rest

Whitespace removal to make all help options align in a similar fashion

Reviewed-by: Yehuda Sadeh <yehuda@redhat.com>
Signed-off-by: Abhishek Lekshmanan <abhishek.lekshmanan@gmail.com>
11 years agoosd/ReplicatedPG: observe INCOMPLETE_CLONES in is_present_clone() 2141/head
Sage Weil [Thu, 24 Jul 2014 01:25:53 +0000 (18:25 -0700)]
osd/ReplicatedPG: observe INCOMPLETE_CLONES in is_present_clone()

We cannot assume that just because cache_mode is NONE that we will have
all clones present; check for the absense of the INCOMPLETE_CLONES flag
here too.

Signed-off-by: Sage Weil <sage@redhat.com>
11 years agoosd/ReplicatedPG: observed INCOMPLETE_CLONES when doing clone subsets
Sage Weil [Thu, 24 Jul 2014 01:24:51 +0000 (18:24 -0700)]
osd/ReplicatedPG: observed INCOMPLETE_CLONES when doing clone subsets

During recovery, we can clone subsets if we know that all clones will be
present.  We skip this on caching pools because they may not be; do the
same when INCOMPLETE_CLONES is set.

Signed-off-by: Sage Weil <sage@redhat.com>
11 years agoosd/ReplicatedPG: do not complain about missing clones when INCOMPLETE_CLONES is set
Sage Weil [Thu, 24 Jul 2014 01:23:56 +0000 (18:23 -0700)]
osd/ReplicatedPG: do not complain about missing clones when INCOMPLETE_CLONES is set

When scrubbing, do not complain about missing cloens when we are in a
caching mode *or* when the INCOMPLETE_CLONES flag is set.  Both are
indicators that we may be missing clones and that that is okay.

Fixes: #8882
Signed-off-by: Sage Weil <sage@redhat.com>
11 years agoosd/osd_types: add pg_pool_t FLAG_COMPLETE_CLONES
Sage Weil [Thu, 24 Jul 2014 01:21:38 +0000 (18:21 -0700)]
osd/osd_types: add pg_pool_t FLAG_COMPLETE_CLONES

Set a flag on the pg_pool_t when we change cache_mode NONE.  This
is because object promotion may promote heads without all of the clones,
and when we switch the cache_mode back those objects may remain.  Do
this on any cache_mode change (to or from NONE) to capture legacy
pools that were set up before this flag existed.

Signed-off-by: Sage Weil <sage@redhat.com>
11 years agomon/OSDMonitor: improve no-op cache_mode set check
Sage Weil [Thu, 24 Jul 2014 17:06:31 +0000 (10:06 -0700)]
mon/OSDMonitor: improve no-op cache_mode set check

If we have a pending pool value but the cache_mode hasn't changed, this is
still a no-op (and we don't need to block).

Backport: firefly
Signed-off-by: Sage Weil <sage@redhat.com>
11 years agoMerge remote-tracking branch 'gh/next'
Sage Weil [Thu, 24 Jul 2014 02:14:52 +0000 (19:14 -0700)]
Merge remote-tracking branch 'gh/next'

11 years agoMerge pull request #2127 from ceph/wip-8701
Sage Weil [Thu, 24 Jul 2014 02:13:55 +0000 (19:13 -0700)]
Merge pull request #2127 from ceph/wip-8701

filestore: fix collection_move behavior

Reviewed-by: Greg Farnum <greg@inktank.com>
11 years agoMerge pull request #2140 from ceph/wip-8889
Sage Weil [Thu, 24 Jul 2014 02:13:11 +0000 (19:13 -0700)]
Merge pull request #2140 from ceph/wip-8889

osd: greedily get obc write lock in some cases

Reviewed-by: Greg Farnum <greg@inktank.com>
11 years agoceph_test_objectstore: clean up on finish of MoveRename 2127/head
Sage Weil [Tue, 22 Jul 2014 13:53:41 +0000 (06:53 -0700)]
ceph_test_objectstore: clean up on finish of MoveRename

Otherwise, we leave collections around, and the next test fails.

Signed-off-by: Sage Weil <sage@redhat.com>
11 years agoos/LFNIndex: use FDCloser for fsync_dir
Sage Weil [Mon, 21 Jul 2014 20:45:21 +0000 (13:45 -0700)]
os/LFNIndex: use FDCloser for fsync_dir

This prevents an fd leak when maybe_inject_failure() throws an exception.

Signed-off-by: Sage Weil <sage@redhat.com>
11 years agoos/LFNIndex: only consider alt xattr if nlink > 1
Sage Weil [Sat, 19 Jul 2014 06:16:09 +0000 (23:16 -0700)]
os/LFNIndex: only consider alt xattr if nlink > 1

If we are doing a lookup, the main xattr fails, we'll check if there is an
alt xattr.  If it exists, but the nlink on the inode is only 1, we will
kill the xattr.  This cleans up the mess left over by an incomplete
lfn_unlink operation.

This resolves the problem with an lfn_link to a second long name that
hashes to the same short_name: we will ignore the old name the moment the
old link goes away.

Fixes: #8701
Signed-off-by: Sage Weil <sage@redhat.com>
11 years agoos/LFNIndex: remove alt xattr after unlink
Sage Weil [Sat, 19 Jul 2014 00:28:18 +0000 (17:28 -0700)]
os/LFNIndex: remove alt xattr after unlink

After we unlink, if the nlink on the inode is still non-zero, remove the
alt xattr.  We can *only* do this after the rename or unlink operation
because we don't want to leave a file system link in place without the
matching xattr; hence the fsync_dir() call.

Note that this might leak an alt xattr if we happen to fail after the
rename/unlink but before the removexattr is committed.  We'll fix that
next.

Signed-off-by: Sage Weil <sage@redhat.com>
11 years agoos/LFNIndex: FDCloser helper
Sage Weil [Mon, 21 Jul 2014 20:43:42 +0000 (13:43 -0700)]
os/LFNIndex: FDCloser helper

Add a helper to close fd's when we leave scope.  This is important when
injecting failures by throwing exceptions.

Signed-off-by: Sage Weil <sage@redhat.com>
11 years agoos/LFNIndex: handle long object names with multiple links (i.e., rename)
Sage Weil [Sat, 19 Jul 2014 00:09:07 +0000 (17:09 -0700)]
os/LFNIndex: handle long object names with multiple links (i.e., rename)

When we rename an object (collection_move_rename) to a different name, and
the name is long, we run into problems because the lfn xattr can only track
a single long name linking to the inode.  For example, suppose we have

foobar -> foo_123_0 (attr: foobar) where foobar hashes to 123.

At first, collection_add could only link a file to another file in a
different collection with the same name. Allowing collection_move_rename
to rename the file, however, means that we have to convert:

col1/foobar -> foo_123_0 (attr: foobar)

to

col1/foobaz -> foo_234_0 (attr: foobaz)

This is a problem because if we link, reset xattr, unlink we end up with

col1/foobar -> foo_123_0 (attr: foobaz)

if we restart after we reset the attr.  This will cause the initial foobar
lookup to since the attr doesn't match, and the file won't be able to be
looked up.

Fix this by allow *two* (long) names to link to the same inode.  If we
lfn_link a second (different) name, move the previous name to the "alt"
xattr and set the new name.  (This works because link is always followed
by unlink.)  On lookup, check either xattr.

Don't even bother to remove the alt xattr on unlink.  This works as long
as the old name and new name don't hash to the same shortname and end up
in the same LFN chain.  (Don't worry, we'll fix that next.)

Fixes part of #8701
Signed-off-by: Sage Weil <sage@redhat.com>
11 years agoceph_test_objectstore: fix warning
Sage Weil [Fri, 18 Jul 2014 22:46:58 +0000 (15:46 -0700)]
ceph_test_objectstore: fix warning

Signed-off-by: Sage Weil <sage@redhat.com>
11 years agostore_test: add long name collection_move_rename tests
Samuel Just [Tue, 15 Jul 2014 21:50:33 +0000 (14:50 -0700)]
store_test: add long name collection_move_rename tests

Currently fails.

Signed-off-by: Samuel Just <sam.just@inktank.com>
11 years agoceph.spec.in: add bash completion file for radosgw-admin
Dan Mick [Thu, 3 Jul 2014 23:11:24 +0000 (16:11 -0700)]
ceph.spec.in: add bash completion file for radosgw-admin

Signed-off-by: Sandon Van Ness <sandon@inktank.com>
(cherry picked from commit b70096307130bcbac176704493a63c5d039d3edc)

11 years agoceph.spec.in: rhel7-related changes:
Dan Mick [Thu, 3 Jul 2014 23:10:55 +0000 (16:10 -0700)]
ceph.spec.in: rhel7-related changes:

udev rules: /lib -> /usr/lib
/sbin binaries move to /usr/sbin or %{_sbindir}

Signed-off-by: Sandon Van Ness <sandon@inktank.com>
(cherry picked from commit 235e4c7de8f8efe491edefbdde8e5da4dfc44034)

11 years agoFix/add missing dependencies:
Dan Mick [Thu, 3 Jul 2014 23:08:44 +0000 (16:08 -0700)]
Fix/add missing dependencies:

- rbd-fuse depends on librados2/librbd1
- ceph-devel depends on specific releases of libs and libcephfs_jni1
- librbd1 depends on librados2
- python-ceph does not depend on libcephfs1

Signed-off-by: Sandon Van Ness <sandon@inktank.com>
(cherry picked from commit 7cf81322391b629b241da90181800ca1f138ce78)

11 years agoceph.spec.in: whitespace fixes
Dan Mick [Thu, 3 Jul 2014 23:05:00 +0000 (16:05 -0700)]
ceph.spec.in: whitespace fixes

Signed-off-by: Sandon Van Ness <sandon@inktank.com>
(cherry picked from commit ec8af52a5ede78511423a1455a496d46d580c644)

11 years agoceph.spec.in: split out ceph-common as in Debian
Dan Mick [Thu, 3 Jul 2014 23:04:10 +0000 (16:04 -0700)]
ceph.spec.in: split out ceph-common as in Debian

Move files, postun scriptlet, and add dependencies on ceph-common
where appropriate

Signed-off-by: Sandon Van Ness <sandon@inktank.com>
(cherry picked from commit e131b9d5a5e90e87d8a8346cb96cb5a26135c144)

11 years agocommon/random_cache: fix typo
Sage Weil [Wed, 23 Jul 2014 17:11:59 +0000 (10:11 -0700)]
common/random_cache: fix typo

Signed-off-by: Sage Weil <sage@redhat.com>
11 years agoMerge pull request #2136 from yuyuyu101/fix-randomcache
Sage Weil [Wed, 23 Jul 2014 16:57:59 +0000 (09:57 -0700)]
Merge pull request #2136 from yuyuyu101/fix-randomcache

common/RandomCache: Fix inconsistence between contents and count

Reviewed-by: Sage Weil <sage@redhat.com>
11 years agocommon/RandomCache: Fix inconsistence between contents and count 2136/head
Haomai Wang [Wed, 23 Jul 2014 03:26:18 +0000 (11:26 +0800)]
common/RandomCache: Fix inconsistence between contents and count

The add/clear method may cause count inconsistent with the real size of
contents.

Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
11 years agoosd/ReplicatedPG: debug obc locks 2140/head
Sage Weil [Wed, 23 Jul 2014 01:01:14 +0000 (18:01 -0700)]
osd/ReplicatedPG: debug obc locks

Signed-off-by: Sage Weil <sage@redhat.com>
11 years agoosd/ReplicatedPG: greedily take write_lock for copyfrom finish, snapdir
Sage Weil [Tue, 22 Jul 2014 20:16:11 +0000 (13:16 -0700)]
osd/ReplicatedPG: greedily take write_lock for copyfrom finish, snapdir

In the cases where we are taking a write lock and are careful
enough that we know we should succeed (i.e, we assert(got)),
use the get_write_greedy() variant that skips the checks for
waiters (be they ops or backfill) that are normally necessary
to avoid starvation.  We don't care about staration here
because our op is already in-progress and can't easily be
aborted, and new ops won't start because they do make those
checks.

Fixes: #8889
Signed-off-by: Sage Weil <sage@redhat.com>
11 years agoosd: allow greedy get_write() for ObjectContext locks
Sage Weil [Tue, 22 Jul 2014 20:11:42 +0000 (13:11 -0700)]
osd: allow greedy get_write() for ObjectContext locks

There are several lockers that need to take a write lock
because there is an operation that is already in progress and
know it is safe to do so.  In particular, they need to skip
the starvation checks (op waiters, backfill waiting).

Signed-off-by: Sage Weil <sage@redhat.com>
11 years agoMerge pull request #2120 from ceph/wip-8858
Josh Durgin [Tue, 22 Jul 2014 23:58:25 +0000 (16:58 -0700)]
Merge pull request #2120 from ceph/wip-8858

Wip 8858

Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agoMerge pull request #2133 from ceph/wip-8897
Gregory Farnum [Tue, 22 Jul 2014 22:36:40 +0000 (15:36 -0700)]
Merge pull request #2133 from ceph/wip-8897

os: fix build warnings with name/attr len checks (fixes 8889)

Reviewed-by: Greg Farnum <greg@inktank.com>
11 years agoMerge pull request #2128 from ceph/wip-8851
João Eduardo Luís [Tue, 22 Jul 2014 21:10:17 +0000 (22:10 +0100)]
Merge pull request #2128 from ceph/wip-8851

mon: AuthMonitor: always encode full regardless of keyserver having keys

Reviewed-by: Gregory Farnum <greg@inktank.com>
Reviewed-by: Sage Weil <sage@redhat.com>
11 years agoos: make name/attr max methods unsigned 2133/head
Sage Weil [Tue, 22 Jul 2014 20:38:32 +0000 (13:38 -0700)]
os: make name/attr max methods unsigned

This fixes warnings when we use these in MIN/MAX macros against
other unsigned values.

Fixes: #8897
Signed-off-by: Sage Weil <sage@redhat.com>
11 years agoos/KeyValueStore: make get_max_object_name_length() sane
Sage Weil [Tue, 22 Jul 2014 20:37:20 +0000 (13:37 -0700)]
os/KeyValueStore: make get_max_object_name_length() sane

This is getting the NAME_MAX from the OS, but in reality the backend
KV store is the limiter.  And for leveldb, there is no real limit.
Return 4096 for now.

Signed-off-by: Sage Weil <sage@redhat.com>
11 years agoMerge pull request #2129 from ceph/wip-librbd-oc
Josh Durgin [Tue, 22 Jul 2014 20:33:24 +0000 (13:33 -0700)]
Merge pull request #2129 from ceph/wip-librbd-oc

librbd: reduce cache flush overhead

Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Haomai Wang <haomaiwang@gmail.com>
11 years agoMerge pull request #2125 from ceph/wip-memstore
Sage Weil [Tue, 22 Jul 2014 17:52:40 +0000 (10:52 -0700)]
Merge pull request #2125 from ceph/wip-memstore

memstore: a few fixes, and enable the tests!

Reviewed-by: Haomai Wang <haomaiwang@gmail.com>
11 years agoMerge pull request #2105 from rootfs/wip-qa-hadoop-wordcount
Sage Weil [Tue, 22 Jul 2014 15:42:03 +0000 (08:42 -0700)]
Merge pull request #2105 from rootfs/wip-qa-hadoop-wordcount

update hadoop-wordcount test to be able to run on hadoop 2.x.

Reviewed-by: Sage Weil <sage@redhat.com>
11 years agouncomment cleanup command 2105/head
rootfs [Tue, 22 Jul 2014 15:31:37 +0000 (11:31 -0400)]
uncomment cleanup command

11 years agopowerdns: RADOS Gateway backend for bucket directioning
Wido den Hollander [Tue, 14 Jan 2014 15:39:00 +0000 (16:39 +0100)]
powerdns: RADOS Gateway backend for bucket directioning

This backend can be used to create one global namespace for multiple
RGW regions.

Using a CNAME DNS response the traffic is directed towards the RGW region
without using HTTP redirects.

11 years agomon: AuthMonitor: always encode full regardless of keyserver having keys 2128/head
Joao Eduardo Luis [Mon, 21 Jul 2014 23:25:37 +0000 (00:25 +0100)]
mon: AuthMonitor: always encode full regardless of keyserver having keys

On clusters without cephx, assuming an admin never added a key to the
cluster, the monitors have empty key servers.  A previous patch had the
AuthMonitor not encoding an empty keyserver as a full version.

As such, whenever the monitor restarts we will have to read the whole
state from disk in the form of incrementals.  This poses a problem upon
trimming, as we do every now and then: whenever we start the monitor, it
will start with an empty keyserver, waiting to be populated from whatever
we have on disk.  This is performed in update_from_paxos(), and the
AuthMonitor's will rely on the keyserver version to decide which
incrementals we care about -- basically, all versions > keyserver version.

Although we started with an empty keyserver (version 0) and are expecting
to read state from disk, in this case it means we will attempt to read
version 1 first.  If the cluster has been running for a while now, and
even if no keys have been added, it's fair to assume that version is
greater than 0 (or even 1), as the AuthMonitor also deals and keeps track
of auth global ids.  As such, we expect to read version 1, then version 2,
and so on.  If we trim at some point however this will not be possible,
as version 1 will not exist -- and we will assert because of that.

This is fixed by ensuring the AuthMonitor keeps track of full versions
of the key server, even if it's of an empty key server -- it will still
keep track of the key server's version, which is incremented each time
we update from paxos even if it is empty.

Fixes: #8851
Backport: dumpling, firefly

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
11 years agoosd: init local_connection for fast_dispatch in _send_boot()
Ma, Jianpeng [Mon, 14 Jul 2014 03:17:14 +0000 (03:17 +0000)]
osd: init local_connection for fast_dispatch in _send_boot()

We were not properly setting up Sessions on the local_connection for
fast_dispatch'ed Messages if the cluster_addr was set explicitly: the OSD
was not in the dispatch list at bind() time (in ceph_osd.cc), and nothing
called it later on. This issue was missed in testing because Inktank only
uses unified NICs.

That led to errors like the following:

When do ec-read, i met a bug which was occured 100%. The messages are:
2014-07-14 10:03:07.318681 7f7654f6e700 -1 osd/OSD.cc: In function
'virtual void OSD::ms_fast_dispatch(Message*)' thread 7f7654f6e700 time
2014-07-14 10:03:07.316782 osd/OSD.cc: 5019: FAILED assert(session)

 ceph version 0.82-585-g79f3f67 (79f3f6749122ce2944baa70541949d7ca75525e6)
 1: (OSD::ms_fast_dispatch(Message*)+0x286) [0x6544b6]
 2: (DispatchQueue::fast_dispatch(Message*)+0x56) [0xb059d6]
 3: (DispatchQueue::run_local_delivery()+0x6b) [0xb08e0b]
 4: (DispatchQueue::LocalDeliveryThread::entry()+0xd) [0xa4a5fd]
 5: (()+0x8182) [0x7f7665670182]
 6: (clone()+0x6d) [0x7f7663a1130d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

To resolve this, we have the OSD invoke ms_handle_fast_connect() explicitly
in send_boot(). It's not really an appropriate location, but we're already
doing a bunch of messenger twiddling there, so it's acceptable for now.

Signed-off-by: Ma Jianpeng <jianpeng.ma@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 9061988ec7eaa922e2b303d9eece86e7c8ee0fa1)

11 years agoObjectCacher: fix bh_{add,remove} dirty_or_tx_bh accounting 2129/head
Josh Durgin [Mon, 21 Jul 2014 21:09:48 +0000 (14:09 -0700)]
ObjectCacher: fix bh_{add,remove} dirty_or_tx_bh accounting

tx buffers need to go on the bh_lru_rest as well, and removing erases
(not inserts) them into dirty_or_tx_bh.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
11 years agoObjectCacher: fix dirty_or_tx_bh logic in bh_set_state()
Josh Durgin [Mon, 21 Jul 2014 21:08:44 +0000 (14:08 -0700)]
ObjectCacher: fix dirty_or_tx_bh logic in bh_set_state()

The else-if chain here was wrong. Handling dirty or tx buffers and
errors should be in independent conditions.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
11 years agoWait tx state buffer in flush_set
Haomai Wang [Wed, 16 Jul 2014 06:34:22 +0000 (14:34 +0800)]
Wait tx state buffer in flush_set

Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
11 years agoAdd rbdcache max dirty object option
Haomai Wang [Mon, 14 Jul 2014 06:27:17 +0000 (14:27 +0800)]
Add rbdcache max dirty object option

Librbd will calculate max dirty object according to rbd_cache_max_size, it
doesn't suitable for every case. If user set image order 24, the calculating
result is too small for reality. It will increase the overhead of trim call
which is called each read/write op.

Now we make it as option for tunning, by default this value is calculated.

Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
11 years agoReduce ObjectCacher flush overhead
Haomai Wang [Mon, 14 Jul 2014 06:32:57 +0000 (14:32 +0800)]
Reduce ObjectCacher flush overhead

Flush op in ObjectCacher will iterate the whole active object set, each
dirty object also may own several BufferHead. If the object set is large,
it will consume too much time.

Use dirty_bh instead to reduce overhead. Now only dirty BufferHead will
be checked.

Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
11 years agoosd: init local_connection for fast_dispatch in _send_boot()
Ma, Jianpeng [Mon, 14 Jul 2014 03:17:14 +0000 (03:17 +0000)]
osd: init local_connection for fast_dispatch in _send_boot()

We were not properly setting up Sessions on the local_connection for
fast_dispatch'ed Messages if the cluster_addr was set explicitly: the OSD
was not in the dispatch list at bind() time (in ceph_osd.cc), and nothing
called it later on. This issue was missed in testing because Inktank only
uses unified NICs.

That led to errors like the following:

When do ec-read, i met a bug which was occured 100%. The messages are:
2014-07-14 10:03:07.318681 7f7654f6e700 -1 osd/OSD.cc: In function
'virtual void OSD::ms_fast_dispatch(Message*)' thread 7f7654f6e700 time
2014-07-14 10:03:07.316782 osd/OSD.cc: 5019: FAILED assert(session)

 ceph version 0.82-585-g79f3f67 (79f3f6749122ce2944baa70541949d7ca75525e6)
 1: (OSD::ms_fast_dispatch(Message*)+0x286) [0x6544b6]
 2: (DispatchQueue::fast_dispatch(Message*)+0x56) [0xb059d6]
 3: (DispatchQueue::run_local_delivery()+0x6b) [0xb08e0b]
 4: (DispatchQueue::LocalDeliveryThread::entry()+0xd) [0xa4a5fd]
 5: (()+0x8182) [0x7f7665670182]
 6: (clone()+0x6d) [0x7f7663a1130d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

To resolve this, we have the OSD invoke ms_handle_fast_connect() explicitly
in send_boot(). It's not really an appropriate location, but we're already
doing a bunch of messenger twiddling there, so it's acceptable for now.

Signed-off-by: Ma Jianpeng <jianpeng.ma@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
11 years agoMerge pull request #2121 from ceph/wip-dencoder
Sage Weil [Mon, 21 Jul 2014 20:10:02 +0000 (13:10 -0700)]
Merge pull request #2121 from ceph/wip-dencoder

limit leveldb linkage; move ceph-dencoder back into ceph-common

Reviewed-by: Dan Mick <dan.mick@inktank.com>
RGW patch Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>

11 years agoMerge pull request #2067 from thorstenb/wip-janitorial-clang-3
Sage Weil [Mon, 21 Jul 2014 16:08:31 +0000 (09:08 -0700)]
Merge pull request #2067 from thorstenb/wip-janitorial-clang-3

[werror] Fix mismatched tags (struct vs. class) inconsistence

Reviewed-by: Sage Weil <sage@redhat.com>
11 years agoFix mismatched tags (struct vs. class) inconsistency 2067/head
Thorsten Behrens [Mon, 21 Jul 2014 15:07:21 +0000 (17:07 +0200)]
Fix mismatched tags (struct vs. class) inconsistency

Signed-off-by: Thorsten Behrens <tbehrens@suse.com>
11 years agoMerge pull request #2111 from ceph/wip-8174
Sage Weil [Sun, 20 Jul 2014 21:21:09 +0000 (14:21 -0700)]
Merge pull request #2111 from ceph/wip-8174

osd: add config for osd_max_object_name_len = 2048 (was hard-coded at 4096)

Reviewed-by: Haomai Wang <haomaiwang@gmail.com>
and the first patch was
Reviewed-by: Samuel Just <sam.just@inktank.com>
11 years agoos/FileStore: fix max object name limit 2111/head
Sage Weil [Sun, 20 Jul 2014 14:48:47 +0000 (07:48 -0700)]
os/FileStore: fix max object name limit

Our max object name is not limited by file name size, but by the length of
the name we can stuff in an xattr.  That will vary from file system to
file system, so just make this 4096.  In practice, it should be limited
via the global tunable, if it is adjusted at all.

Signed-off-by: Sage Weil <sage@redhat.com>
11 years agoceph_test_objectstore: test memstore 2125/head
Sage Weil [Sat, 19 Jul 2014 20:55:22 +0000 (13:55 -0700)]
ceph_test_objectstore: test memstore

Signed-off-by: Sage Weil <sage@redhat.com>
11 years agoos/MemStore: copy attrs on clone
Sage Weil [Sat, 19 Jul 2014 20:54:23 +0000 (13:54 -0700)]
os/MemStore: copy attrs on clone

Backport: firefly
Signed-off-by: Sage Weil <sage@redhat.com>
11 years agoos/MemStore: fix wrlock ordering checks
Sage Weil [Sat, 19 Jul 2014 20:24:21 +0000 (13:24 -0700)]
os/MemStore: fix wrlock ordering checks

We can't compare the shared_ptrs themselves; we need to compare the
addresses of the actual objects.

Signed-off-by: Sage Weil <sage@redhat.com>
11 years agoosd/MemStore: handle collection_move_rename within the same collection
Sage Weil [Fri, 18 Jul 2014 23:24:07 +0000 (16:24 -0700)]
osd/MemStore: handle collection_move_rename within the same collection

Signed-off-by: Sage Weil <sage@redhat.com>
11 years agoceph-dencoder: don't link librgw.la (and rados, etc.) 2121/head
Sage Weil [Sat, 19 Jul 2014 05:44:51 +0000 (22:44 -0700)]
ceph-dencoder: don't link librgw.la (and rados, etc.)

Signed-off-by: Sage Weil <sage@redhat.com>
11 years agorgw: move a bunch of stuff into rgw_dencoder
Sage Weil [Sat, 19 Jul 2014 05:27:25 +0000 (22:27 -0700)]
rgw: move a bunch of stuff into rgw_dencoder

This will help out ceph-dencoder ...

Signed-off-by: Sage Weil <sage@redhat.com>
11 years agolibosd_types, libos_types, libmon_types
Sage Weil [Sat, 19 Jul 2014 04:58:29 +0000 (21:58 -0700)]
libosd_types, libos_types, libmon_types

Signed-off-by: Sage Weil <sage@redhat.com>
11 years agoRevert "ceph.spec: move ceph-dencoder to ceph from ceph-common"
Sage Weil [Sat, 19 Jul 2014 03:55:39 +0000 (20:55 -0700)]
Revert "ceph.spec: move ceph-dencoder to ceph from ceph-common"

This reverts commit 95f5a448b52db545a2b9bbad47fdb287254f93ea.

11 years agoRevert "debian: move ceph-dencoder to ceph from ceph-common"
Sage Weil [Sat, 19 Jul 2014 03:55:35 +0000 (20:55 -0700)]
Revert "debian: move ceph-dencoder to ceph from ceph-common"

This reverts commit b37e3bde3bd31287b11c069062280258666df7c5.

11 years agounittest_osdmap: revert a few broken changes
Sage Weil [Fri, 18 Jul 2014 23:49:46 +0000 (16:49 -0700)]
unittest_osdmap: revert a few broken changes

From commit 80ea6067f790b9431ae6744c38a034833e8ad4ab.

Signed-off-by: Sage Weil <sage@redhat.com>
11 years agorgw: dump prefix unconditionally 2120/head
Yehuda Sadeh [Fri, 18 Jul 2014 21:52:48 +0000 (14:52 -0700)]
rgw: dump prefix unconditionally

As part of issue #8858, and to be more in line with S3, dump the Prefix
field when listing bucket even if bucket is empty.

Signed-off-by: Yehuda Sadeh <yehuda@redhat.com>
11 years agorgw: list extra objects to set truncation flag correctly
Yehuda Sadeh [Thu, 17 Jul 2014 22:48:26 +0000 (15:48 -0700)]
rgw: list extra objects to set truncation flag correctly

Otherwise we end up returning wrong truncated value, and no data on the
next iteration.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
11 years agorgw: account common prefixes for MaxKeys in bucket listing
Yehuda Sadeh [Thu, 17 Jul 2014 18:45:44 +0000 (11:45 -0700)]
rgw: account common prefixes for MaxKeys in bucket listing

To be more in line with the S3 api. Beforehand we didn't account the
common prefixes towards the MaxKeys (a single common prefix counts as a
single key). Also need to adjust the marker now if it is pointing at a
common prefix.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
11 years agorgw: add NextMarker param for bucket listing
Yehuda Sadeh [Thu, 17 Jul 2014 18:24:51 +0000 (11:24 -0700)]
rgw: add NextMarker param for bucket listing

Partially fixes #8858.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
11 years agorgw: fix decoding + characters in URL
devicenull [Fri, 18 Jul 2014 14:25:51 +0000 (10:25 -0400)]
rgw: fix decoding + characters in URL

Fixes: #8702
Backport: firefly

Only decode + characters to spaces if we're in a query argument. The +
query argument.  The + => ' ' translation is not correct for
file/directory names.

Resolves http://tracker.ceph.com/issues/8702

Reviewed-by: Yehuda Sadeh <yehuda@redhat.com>
Signed-off-by: Brian Rak <dn@devicenull.org>
11 years agocrushtool: Send output to stdout instead of stderr
Wido den Hollander [Thu, 17 Jul 2014 21:19:27 +0000 (23:19 +0200)]
crushtool: Send output to stdout instead of stderr

A lot of output was send to stderr instead of stdout and vise versa.

Error messages should go to stderr, but all other output to stdout

11 years agoMerge pull request #2115 from ceph/wip-8811
Gregory Farnum [Fri, 18 Jul 2014 18:17:52 +0000 (11:17 -0700)]
Merge pull request #2115 from ceph/wip-8811

Make standby-replay MDSes much more careful about journal formats; both changing them and generally being aware.

Reviewed-by: Greg Farnum <greg@inktank.com>
11 years agorgw: improve delmited listing of bucket
Yehuda Sadeh [Wed, 16 Jul 2014 22:21:09 +0000 (15:21 -0700)]
rgw: improve delmited listing of bucket

If found a prefix, calculate a string greater than that so that next
request we can skip to that. This is still not the most efficient way to
do it. It'll be better to push it down to the objclass, but that'll
require a much bigger change.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
11 years agoutf8: export encode_utf8() and decode_utf8()
Yehuda Sadeh [Wed, 16 Jul 2014 23:05:58 +0000 (16:05 -0700)]
utf8: export encode_utf8() and decode_utf8()

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
11 years agoosd: add config for osd_max_attr_name_len = 100
Sage Weil [Fri, 18 Jul 2014 17:44:49 +0000 (10:44 -0700)]
osd: add config for osd_max_attr_name_len = 100

Set a limit on the length of an attr name.  The fs can only take 128
bytes, but we were not imposing any limit.

Add a test.

Reported-by: Haomai Wang <haomaiwang@gmail.com>
Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoos: add ObjectStore::get_max_attr_name_length()
Sage Weil [Fri, 18 Jul 2014 17:42:11 +0000 (10:42 -0700)]
os: add ObjectStore::get_max_attr_name_length()

Most importantly, capture that attrs on FileStore can't be more than about
100 chars.  The Linux xattrs can only be 128 chars, but we also have some
prefixing we do.

Signed-off-by: Sage Weil <sage@redhat.com>
11 years agoosd: add config for osd_max_object_name_len = 2048 (was hard-coded at 4096)
Sage Weil [Wed, 16 Jul 2014 21:17:27 +0000 (14:17 -0700)]
osd: add config for osd_max_object_name_len = 2048 (was hard-coded at 4096)

Previously we had a hard coded limit of 4096.  Objects > 3k crash the OSD
when running on ext4, although they probably work on xfs.  But rgw only
generates objects a bit over 1024 bytes (maybe 1200 tops?), so let set a
more reasonable limit here.  2048 is a nice round number and should be
safe.

Add a test.

Fixes: #8174
Signed-off-by: Sage Weil <sage@redhat.com>
11 years agoosdc: refactor JOURNAL_FORMAT_* constants to enum 2115/head
John Spray [Fri, 18 Jul 2014 17:39:37 +0000 (18:39 +0100)]
osdc: refactor JOURNAL_FORMAT_* constants to enum

...so that the upper limit doesn't have to be updated
by hand.

Signed-off-by: John Spray <john.spray@redhat.com>
11 years agodoc: fix example s/inspect/journal inspect/
John Spray [Thu, 17 Jul 2014 12:26:55 +0000 (13:26 +0100)]
doc: fix example s/inspect/journal inspect/

Signed-off-by: John Spray <john.spray@redhat.com>
11 years agomds: fix journal reformat failure in standbyreplay
John Spray [Thu, 17 Jul 2014 12:15:45 +0000 (13:15 +0100)]
mds: fix journal reformat failure in standbyreplay

In the 0.82 release, standbyreplay MDS daemons would try
to reformat the jouranl if they saw an older version on
disk, where this should have only been done by the active
MDS for the rank.  Depending on timing, this could cause
fatal corruption of the journal.

This change handles the following cases:
* only do reformat if not in standbyreplay (else raise EAGAIN
to keep trying til an active mds reformats it)
* if journal header goes away while in standbyreplay then raise
EAGAIN (handle rewrite happening in background)
* if journal version is greater than the max supported, suicide

Fixes: #8811
Signed-off-by: John Spray <john.spray@redhat.com>
11 years agoosdc/Journaler: validate header on load and save
John Spray [Thu, 17 Jul 2014 12:15:10 +0000 (13:15 +0100)]
osdc/Journaler: validate header on load and save

Previously if the journal header contained invalid
write, expire or trimmed offsets, we would end up
hitting a hard-to-understand assertion much later.

Instead, raise the error right away if the fields
are identifiably bad at load time, and assert that
they're valid before persisting them.

Signed-off-by: John Spray <john.spray@redhat.com>
11 years agoMerge pull request #2104 from ceph/wip-dencoder
Sage Weil [Fri, 18 Jul 2014 17:29:50 +0000 (10:29 -0700)]
Merge pull request #2104 from ceph/wip-dencoder

move ceph-dencoder to ceph from ceph-common

Reviewed-by: Dan Mick <dan.mick@inktank.com>
11 years agoMerge pull request #2114 from ceph/wip-vstart
Sage Weil [Fri, 18 Jul 2014 17:27:51 +0000 (10:27 -0700)]
Merge pull request #2114 from ceph/wip-vstart

vstart.sh: default to 3 osds

Not-NAKed-by: John Spray <john.spray@inktank.com>
11 years agotest: add a missing semicolon
John Spray [Fri, 18 Jul 2014 17:00:44 +0000 (18:00 +0100)]
test: add a missing semicolon

Broke in df8f48628.

Signed-off-by: John Spray <john.spray@redhat.com>