]> git.apps.os.sepia.ceph.com Git - ceph.git/log
ceph.git
11 years agoosd: Add hit_set_flushing to track current flushes and prevent races 1405/head
David Zafman [Fri, 7 Mar 2014 02:08:46 +0000 (18:08 -0800)]
osd: Add hit_set_flushing to track current flushes and prevent races

When flushing a HitSet track in hit_set_flushing map so that
agent_load_hit_sets() doesn't try to read it too soon.

Fixes: #7575
Signed-off-by: David Zafman <david.zafman@inktank.com>
11 years agoMerge pull request #1386 from ceph/wip-7624
Sage Weil [Thu, 6 Mar 2014 19:01:30 +0000 (11:01 -0800)]
Merge pull request #1386 from ceph/wip-7624

ReplicatedPG: ensure clones are readable after find_object_context

Reviewed-by: Sage Weil <sage@inktank.com>
11 years agoMerge pull request #1387 from ceph/wip-7618
Sage Weil [Thu, 6 Mar 2014 18:59:47 +0000 (10:59 -0800)]
Merge pull request #1387 from ceph/wip-7618

ReplicatedPG::wait_for_degraded_object: only recover if found

Reviewed-by: Sage Weil <sage@inktank.com>
11 years agoMerge pull request #1379 from ceph/wip-7562
Gregory Farnum [Thu, 6 Mar 2014 18:44:42 +0000 (10:44 -0800)]
Merge pull request #1379 from ceph/wip-7562

mon: make quorum list (by name) be in quorum order

Reviewed-by: Greg Farnum <greg@inktank.com>
11 years agoReplicatedPG: ensure clones are readable after find_object_context 1386/head
Samuel Just [Thu, 6 Mar 2014 01:39:42 +0000 (17:39 -0800)]
ReplicatedPG: ensure clones are readable after find_object_context

We only get EAGAIN if the object is missing.  We also need the
clone to be readable if we are reading it.

The other find_object_context callers already require !degraded.

Fixes: #7624
Signed-off-by: Samuel Just <sam.just@inktank.com>
11 years agoMerge pull request #1378 from ceph/wip-7487
Gregory Farnum [Thu, 6 Mar 2014 04:58:40 +0000 (20:58 -0800)]
Merge pull request #1378 from ceph/wip-7487

mon: no crush buckets with type 0 (#7487)

Reviewed-by: Greg Farnum <greg@inktank.com>
11 years agoMerge pull request #1380 from ceph/wip-pool-delete
João Eduardo Luís [Thu, 6 Mar 2014 01:38:33 +0000 (01:38 +0000)]
Merge pull request #1380 from ceph/wip-pool-delete

mon/OSDMonitor: fix pool deletion races

Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
11 years agomon/OSDMonitor: fix pool deletion checks, races 1380/head
Sage Weil [Wed, 5 Mar 2014 23:58:52 +0000 (15:58 -0800)]
mon/OSDMonitor: fix pool deletion checks, races

Unify the pool deletion safety checks into a single set of functions.
Make sure we check the committed state and error out if there is a problem.
Also check the pending state, if any, and delay+retry if there is a
problem there.

This ensures that we correctly verify that a pool is not in use when it
is deleted (by another tier or by cephfs).  These checks are also now
applied to librados calls.

Fixes: #7590
Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoReplicatedPG::wait_for_degraded_object: only recover if found 1387/head
Samuel Just [Wed, 5 Mar 2014 23:51:10 +0000 (15:51 -0800)]
ReplicatedPG::wait_for_degraded_object: only recover if found

Fixes: #7618
Signed-off-by: Samuel Just <sam.just@inktank.com>
11 years agoMerge pull request #1381 from ceph/wip-7618
Samuel Just [Wed, 5 Mar 2014 23:31:57 +0000 (15:31 -0800)]
Merge pull request #1381 from ceph/wip-7618

ReplicatedPG::recover_replicas: do not assume that missing objects are u...

Reviewed-by: Sage Weil <sage@inktank.com>
11 years agoMerge pull request #1382 from ceph/wip-7616
Samuel Just [Wed, 5 Mar 2014 23:18:19 +0000 (15:18 -0800)]
Merge pull request #1382 from ceph/wip-7616

Wip 7616

Reviewed-by: Sage Weil <sage@inktank.com>
11 years agoReplicatedPG::recover_replicas: do not assume that missing objects are unfound 1381/head
Samuel Just [Wed, 5 Mar 2014 21:16:18 +0000 (13:16 -0800)]
ReplicatedPG::recover_replicas: do not assume that missing objects are unfound

Fixes: #7618
Signed-off-by: Samuel Just <sam.just@inktank.com>
11 years agoRevert "c_read_operations.cc: fix resource leak"
David Zafman [Wed, 5 Mar 2014 23:08:41 +0000 (15:08 -0800)]
Revert "c_read_operations.cc: fix resource leak"

This reverts commit 3cd751b0a280909510c3e633cc8cd4b9f5e3b2d9.

A rados_release_read_op() has already been performed, but coverity
didn't recognize that as releasing memory.

Fixes: #7621
Signed-off-by: David Zafman <david.zafman@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
11 years agomon: make quorum list (by name) be in quorum order 1379/head
Sage Weil [Wed, 5 Mar 2014 22:28:49 +0000 (14:28 -0800)]
mon: make quorum list (by name) be in quorum order

Fixes: #7562
Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoMerge pull request #1343 from ceph/wip-cache-warn-full
Gregory Farnum [Wed, 5 Mar 2014 22:22:34 +0000 (14:22 -0800)]
Merge pull request #1343 from ceph/wip-cache-warn-full

mon: warn when cache tier is full

Reviewed-by: Loic Dachary <loic@dachary.org>
Reviewed-by: Greg Farnum <greg@inktank.com>
11 years agoMerge pull request #1372 from ceph/wip-7607
Sage Weil [Wed, 5 Mar 2014 22:21:45 +0000 (14:21 -0800)]
Merge pull request #1372 from ceph/wip-7607

wip 7607

Reviewed-by: Sage Weil <sage@inktank.com>
11 years agoMerge pull request #1351 from ceph/wip-7248
Sage Weil [Wed, 5 Mar 2014 22:18:28 +0000 (14:18 -0800)]
Merge pull request #1351 from ceph/wip-7248

osd: OSD: limit the value of 'size' and 'count' on 'osd bench'

Reviewed-by: Sage Weil <sage@inktank.com>
11 years agotest: merge unittest_crushwrapper and unittest_crush_wrapper 1378/head
Sage Weil [Wed, 5 Mar 2014 21:17:45 +0000 (13:17 -0800)]
test: merge unittest_crushwrapper and unittest_crush_wrapper

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agomon/OSDMonitor: disallow crush buckets of type 0
Sage Weil [Wed, 5 Mar 2014 21:15:58 +0000 (13:15 -0800)]
mon/OSDMonitor: disallow crush buckets of type 0

Prevent creation of buckets of type 0 ('osd', 'device', etc.), as they
will confusing the mapping algorithm.

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoPGBackend::rollback_stash: remove the correct shard 1382/head
Samuel Just [Wed, 5 Mar 2014 20:51:08 +0000 (12:51 -0800)]
PGBackend::rollback_stash: remove the correct shard

Fixes: #7616
Signed-off-by: Samuel Just <sam.just@inktank.com>
11 years agoFileStore::_collection_move_rename: propogate EEXIST
Samuel Just [Wed, 5 Mar 2014 20:50:43 +0000 (12:50 -0800)]
FileStore::_collection_move_rename: propogate EEXIST

Previously, an EEXIST would get masked by the subsequent clone
operation.

Signed-off-by: Samuel Just <sam.just@inktank.com>
11 years agoqa/workunits/mon/crush_ops: use expect_false
Sage Weil [Wed, 5 Mar 2014 20:52:08 +0000 (12:52 -0800)]
qa/workunits/mon/crush_ops: use expect_false

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoMerge pull request #1376 from ceph/wip-7608
Sage Weil [Wed, 5 Mar 2014 20:35:56 +0000 (12:35 -0800)]
Merge pull request #1376 from ceph/wip-7608

test: Fix tiering test cases to use ---force-nonempty

Reviewed-by: Sage Weil <sage@inktank.com>
11 years agotest: Fix tiering test cases to use ---force-nonempty 1376/head
David Zafman [Wed, 5 Mar 2014 20:31:29 +0000 (12:31 -0800)]
test: Fix tiering test cases to use ---force-nonempty

Fixes: #7608
Signed-off-by: David Zafman <david.zafman@inktank.com>
11 years agomon: warn when pool nears target max objects/bytes 1343/head
Sage Weil [Wed, 5 Mar 2014 18:58:37 +0000 (10:58 -0800)]
mon: warn when pool nears target max objects/bytes

The cache pools will throttle when they reach the target max size, so it
is important to make the administrator aware when they approach that point.
Unfortunately it is not particularly easy to efficiently keep track of
which PGs have hit their limit and use that for reporting.  However, it
is easy to raise a flag when we start to approach the target for the
entire pool, and that sort of early warning is arguably more useful
anyway.

Trigger the warning based on the target full ratio.  Not when we hit the
target, but when we are 2/3 between it and completely full.

Implements: #7442
Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoMerge pull request #1375 from ceph/wip-pgmap-stat
Sage Weil [Wed, 5 Mar 2014 19:07:03 +0000 (11:07 -0800)]
Merge pull request #1375 from ceph/wip-pgmap-stat

mon/PGMap: return empty stats if pool is not in sum

Reviewed-by: Greg Farnum <greg@inktank.com>
11 years agomon/PGMap: return empty stats if pool is not in sum 1375/head
Sage Weil [Wed, 5 Mar 2014 18:44:41 +0000 (10:44 -0800)]
mon/PGMap: return empty stats if pool is not in sum

Greg was right!

When a pool is created, the PGs are not added to the PGMap until the *next*
proposal.  Weaken the assert here and return empty stats for non-existent
(new) pools so that a pool create + tier add sequence does not crash.

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoMerge pull request #1373 from ceph/wip-crush-json
Sage Weil [Wed, 5 Mar 2014 16:52:45 +0000 (08:52 -0800)]
Merge pull request #1373 from ceph/wip-crush-json

crush: revise JSON format for 'item' type

Reviewed-by: Sage Weil <sage@inktank.com>
11 years agocrush: revise JSON format for 'item' type 1373/head
John Spray [Wed, 5 Mar 2014 15:50:53 +0000 (15:50 +0000)]
crush: revise JSON format for 'item' type

Commit a7e9a7b648 changed the JSON format of CRUSH rules
such that the 'item' attribute on a step was sometimes
an integer and sometimes a string.

This commit separates the integer and string representations
so that tools which rely on a 'item' consistently being an
integer ID will work.

Signed-off-by: John Spray <john.spray@inktank.com>
11 years agoReplicatedPG::fill_in_copy_get: fix omap loop conditions 1372/head
Samuel Just [Wed, 5 Mar 2014 01:05:36 +0000 (17:05 -0800)]
ReplicatedPG::fill_in_copy_get: fix omap loop conditions

cursor.omap_offet indicates the most recently recovered key, we continue
filling in at the smallest key k | k > cursor.omap_offset.  If the loop
as written terminates due to !(left > 0), iter points at the next key to
copy, rather than the last key copied, resulting in the next copy
operation skipping that key.

Now, iter, if valid, must point to the last key copied once the loop has
completed since we check left <= 0 prior to advancing iter.  We can
therefore use it to fill in cursor.omap_offset.

Signed-off-by: Samuel Just <sam.just@inktank.com>
11 years agoReplicatedPG::fill_in_copy_get: remove extraneous if statement
Samuel Just [Wed, 5 Mar 2014 01:03:37 +0000 (17:03 -0800)]
ReplicatedPG::fill_in_copy_get: remove extraneous if statement

This should leave the behavior unchanged.

Signed-off-by: Samuel Just <sam.just@inktank.com>
11 years agoReplicatedPG::fill_in_copy_get: fix early return bug
Samuel Just [Tue, 4 Mar 2014 23:21:09 +0000 (15:21 -0800)]
ReplicatedPG::fill_in_copy_get: fix early return bug

This is not a leak: we are in an else block where cb must
be NULL.  The fix as introduced did not include braces on
the if causing the method to return unconditionally.

Fixes: #7604
Introduced in: 500206d809f0cd85cd99e4f0ec164bbf74f92c28
Reviewed-by: David Zafman <david.zafman@inktank.com>
Signed-off-by: Samuel Just <sam.just@inktank.com>
11 years agoMerge remote-tracking branch 'upstream/wip-7447' into firefly
Samuel Just [Wed, 5 Mar 2014 03:22:08 +0000 (19:22 -0800)]
Merge remote-tracking branch 'upstream/wip-7447' into firefly

Reviewed-by: Greg Farnum <greg@inktank.com>
11 years agoECBackend,ReplicatedPG: delete temp if we didn't get the transaction 1367/head
Samuel Just [Mon, 3 Mar 2014 23:33:51 +0000 (15:33 -0800)]
ECBackend,ReplicatedPG: delete temp if we didn't get the transaction

We always send the transaction for operations on temp objects,
but if we didn't get the final transacition on the actual object,
we might end up failing to remove the temp object.  Thus, if
we get a sub op and don't have the transaction, just remove the
named temp objects.

Fixes: #7447
Signed-off-by: Samuel Just <sam.just@inktank.com>
11 years agoPGBackend/ECBackend: handle temp objects correctly
Samuel Just [Tue, 4 Mar 2014 01:08:10 +0000 (17:08 -0800)]
PGBackend/ECBackend: handle temp objects correctly

Signed-off-by: Samuel Just <sam.just@inktank.com>
11 years agoECMsgTypes: fix constructor temp_added/temp_removed ordering to match users
Samuel Just [Tue, 4 Mar 2014 17:16:08 +0000 (09:16 -0800)]
ECMsgTypes: fix constructor temp_added/temp_removed ordering to match users

Signed-off-by: Samuel Just <sam.just@inktank.com>
11 years agoReplicatedPG::finish_ctx: use correct snapdir prior version in events
Samuel Just [Tue, 4 Mar 2014 06:22:39 +0000 (22:22 -0800)]
ReplicatedPG::finish_ctx: use correct snapdir prior version in events

Fixes: #7595
Reviewed-by: Greg Farnum <greg@inktank.com>
Signed-off-by: Samuel Just <sam.just@inktank.com>
11 years agoMerge pull request #1360 from enovance/wip-brag
Loic Dachary [Tue, 4 Mar 2014 11:41:54 +0000 (12:41 +0100)]
Merge pull request #1360 from enovance/wip-brag

Fixes for ceph-brag

Reviewed-by: Loic Dachary <loic@dachary.org>
11 years agoMerge remote-tracking branch 'brag/master' into firefly 1360/head
Babu Shanmugam [Tue, 4 Mar 2014 08:46:38 +0000 (14:16 +0530)]
Merge remote-tracking branch 'brag/master' into firefly

Signed-off-by: Babu Shanmugam <anbu@enovance.com>
11 years agoMerge pull request #1352 from dachary/wip-7578
Sage Weil [Tue, 4 Mar 2014 05:43:39 +0000 (21:43 -0800)]
Merge pull request #1352 from dachary/wip-7578

common: -- support for env_to_vec

Reviewed-by: Sage Weil <sage@inktank.com>
11 years agoMerge pull request #1342 from ceph/wip-cache-add
Sage Weil [Tue, 4 Mar 2014 05:37:56 +0000 (21:37 -0800)]
Merge pull request #1342 from ceph/wip-cache-add

mon: add 'osd tier add-cache ...' command (DNM until after wip-tier-add)

Reviewed-by: Loic Dachary <loic@dachary.org>
11 years agoMerge pull request #1335 from ceph/wip-tier-add
Sage Weil [Tue, 4 Mar 2014 05:36:22 +0000 (21:36 -0800)]
Merge pull request #1335 from ceph/wip-tier-add

mon: prevent non-empty pools from being added as tiers

Reviewed-by: Greg Farnum <greg@inktank.com>
11 years agoMerge pull request #1358 from ceph/wip-2288
Gregory Farnum [Tue, 4 Mar 2014 05:19:40 +0000 (21:19 -0800)]
Merge pull request #1358 from ceph/wip-2288

mds: check projected xattr when handling setxattr

Reviewed-by: Greg Farnum <greg@inktank.com>
11 years agomon/OSDMonitor: fix race in 'osd tier remove ...' 1342/head
Sage Weil [Mon, 3 Mar 2014 19:35:28 +0000 (11:35 -0800)]
mon/OSDMonitor: fix race in 'osd tier remove ...'

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agomon/OSDMonitor: fix some whitespace
Sage Weil [Mon, 3 Mar 2014 19:33:21 +0000 (11:33 -0800)]
mon/OSDMonitor: fix some whitespace

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agomon/OSDMonitor: add 'osd tier add-cache <pool> <size>' command
Sage Weil [Mon, 3 Mar 2014 19:33:13 +0000 (11:33 -0800)]
mon/OSDMonitor: add 'osd tier add-cache <pool> <size>' command

This is a friendlier interface for setting up a cache tier with some
reasonable defaults (defined via config options).  This will simplify
the user experience and documentation.

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agomon/OSDMonitor: handle 'osd tier add ...' race/corner case
Sage Weil [Mon, 3 Mar 2014 19:32:48 +0000 (11:32 -0800)]
mon/OSDMonitor: handle 'osd tier add ...' race/corner case

If you have two racing requests to add two different pools as a tier, the
committed checks will pass but they proposals will conflict.  Recheck the
pending pools for the same conditions and wait for a commit if they
occur.

Reported-by: Loic Dachary <loic@dachary.org>
Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoosd: make default bloom hit set fpp configurable
Sage Weil [Mon, 3 Mar 2014 16:51:25 +0000 (08:51 -0800)]
osd: make default bloom hit set fpp configurable

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoosd/ReplicatedPG: fix agent division by zero
Sage Weil [Sat, 1 Mar 2014 10:29:38 +0000 (02:29 -0800)]
osd/ReplicatedPG: fix agent division by zero

If the pool is empty we cannot divide by the object count.

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoOSDMonitor: do not add non-empty tier pool unless forced 1335/head
Sage Weil [Tue, 4 Mar 2014 05:11:17 +0000 (21:11 -0800)]
OSDMonitor: do not add non-empty tier pool unless forced

In general, users should not use non-empty pools as new tiers or else
things can behave strangely:

 - the data sets are unrelated behavior will be... strange.
 - if the cache pool is not "new" and does not do the OMAP flag, the OSD
   will not know not to flush omap objects to an EC base tier
 - probably other random stuff I'm forgetting

Allow a user to shoot themselves in the foot with --force-nonempty.

Implements: #7457
Signed-off-by: Sage Weil <sage@inktank.com>
11 years agomds: check projected xattr when handling setxattr 1358/head
Yan, Zheng [Tue, 4 Mar 2014 02:26:58 +0000 (10:26 +0800)]
mds: check projected xattr when handling setxattr

Fixes: #2288
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
11 years agoMerge pull request #1354 from ceph/wip-7563
Samuel Just [Tue, 4 Mar 2014 01:05:53 +0000 (17:05 -0800)]
Merge pull request #1354 from ceph/wip-7563

Wip 7563

Reviewed-by: Greg Farnum <greg@inktank.com>
11 years agoMerge pull request #1355 from ceph/wip-osd-verbosity
Samuel Just [Tue, 4 Mar 2014 00:38:54 +0000 (16:38 -0800)]
Merge pull request #1355 from ceph/wip-osd-verbosity

osd: be a bit more verbose on startup

Reviewed-by: Samuel Just <sam.just@inktank.com>
11 years agoTestPGLog: tests for proc_replica_log/merge_log equivalence 1354/head
Samuel Just [Mon, 3 Mar 2014 01:31:38 +0000 (17:31 -0800)]
TestPGLog: tests for proc_replica_log/merge_log equivalence

We need the merge_log and proc_replica_log paths to result in the
same missing set.  This patch adds some machinery for specifying
a log merge scenario and comparing both paths to the same correct
result.  This machinery also makes it a bit easier to read and add
new tests.

Signed-off-by: Samuel Just <sam.just@inktank.com>
11 years agoTestPGLog::proc_replica_log: adjust wonky test
Samuel Just [Sun, 2 Mar 2014 21:45:27 +0000 (13:45 -0800)]
TestPGLog::proc_replica_log: adjust wonky test

This test didn't quite make sense since the divergent entry
cannot be from a newer epoch.  It also didn't quite match the
diagram.

Signed-off-by: Samuel Just <sam.just@inktank.com>
11 years agoTestPGLog::proc_replica_log: adjust to corrected proc_replica_log behavior
Samuel Just [Sun, 2 Mar 2014 21:44:03 +0000 (13:44 -0800)]
TestPGLog::proc_replica_log: adjust to corrected proc_replica_log behavior

Signed-off-by: Samuel Just <sam.just@inktank.com>
11 years agoTestPGLog::proc_replica_log: add prior_version to some entries
Samuel Just [Sun, 2 Mar 2014 21:43:15 +0000 (13:43 -0800)]
TestPGLog::proc_replica_log: add prior_version to some entries

Otherwise, the test logs are invalid.

Signed-off-by: Samuel Just <sam.just@inktank.com>
11 years agoPGLog::proc_replica_log: _merge_divergent_entries based on truncated olog
Samuel Just [Sun, 2 Mar 2014 21:42:16 +0000 (13:42 -0800)]
PGLog::proc_replica_log: _merge_divergent_entries based on truncated olog

We can't merge using the primary's log since we haven't decided whether
to send them a complete log yet.  Thus, merge based on the truncated olog
rather than the primary's log.  This is a consequence of the division
between trimming divergent entries in peering/unfound search and sending
a complete log to actual members of the actingbackfill set in activate().
_merge_divergent_entries on the truncated log and add_next_event() on the
newer entries result in the same missing/log regardless of the order in
which they are performed.

Signed-off-by: Samuel Just <sam.just@inktank.com>
11 years agoPG.h:PGLogEntryHandler: remove silly cant_rollback logic
Samuel Just [Sun, 2 Mar 2014 21:39:07 +0000 (13:39 -0800)]
PG.h:PGLogEntryHandler: remove silly cant_rollback logic

Also, we now call rollback in a reverse order, so there is no
need to reverse the entries again.

Signed-off-by: Samuel Just <sam.just@inktank.com>
11 years agoPG,PGLog: replace _merge_old_entry with _merge_object_divergent_entries
Samuel Just [Sun, 2 Mar 2014 21:38:12 +0000 (13:38 -0800)]
PG,PGLog: replace _merge_old_entry with _merge_object_divergent_entries

The _merge_old_entry structure had trouble distinguishing between the
following cases:

missing: foo, 1,1
merge_old_entry modify 1,1 0,0
merge_old_entry modify 1,2 1,1

and
merge_old_entry modify 1,2 1,1

In the first case, we should end up with foo removed from missing
at the end.  In the second, we need foo added to missing at 1,1.
It's far simpler to present all of the divergent entries for a single
object at once.

Signed-off-by: Samuel Just <sam.just@inktank.com>
11 years agoTestPGLog::merge_old_entry: ne.version cannot be oe.version
Samuel Just [Sun, 2 Mar 2014 05:41:55 +0000 (21:41 -0800)]
TestPGLog::merge_old_entry: ne.version cannot be oe.version

Otherwise, it would not be divergent!

Signed-off-by: Samuel Just <sam.just@inktank.com>
11 years agoTestPGLog::merge_old_entry: we no longer use merge_old_entry this way
Samuel Just [Sun, 2 Mar 2014 05:45:18 +0000 (21:45 -0800)]
TestPGLog::merge_old_entry: we no longer use merge_old_entry this way

This needs to be replaced with an equivalent test of
_merge_object_divergent_entries.

Signed-off-by: Samuel Just <sam.just@inktank.com>
11 years agoTestPGLog:rewind_divergent_log: set prior_version for delete
Samuel Just [Sun, 2 Mar 2014 05:21:34 +0000 (21:21 -0800)]
TestPGLog:rewind_divergent_log: set prior_version for delete

Signed-off-by: Samuel Just <sam.just@inktank.com>
11 years agoTestPGLog: ignore merge_old_entry return value
Samuel Just [Sat, 1 Mar 2014 23:24:14 +0000 (15:24 -0800)]
TestPGLog: ignore merge_old_entry return value

No callers use the merge_old_entry return value.  _merge_divergent_entries
won't have one.

Signed-off-by: Samuel Just <sam.just@inktank.com>
11 years agoTestPGLog: not worth maintaining tests of assert behavior
Samuel Just [Sat, 1 Mar 2014 23:22:53 +0000 (15:22 -0800)]
TestPGLog: not worth maintaining tests of assert behavior

Signed-off-by: Samuel Just <sam.just@inktank.com>
11 years agoMerge pull request #1356 from ceph/wip-7458
David Zafman [Mon, 3 Mar 2014 22:47:38 +0000 (14:47 -0800)]
Merge pull request #1356 from ceph/wip-7458

osd: stray pg ref on shutdown

Reviewed-by: Samuel Just <sam.just@inktank.com>
11 years agoOSD,config_opts: log osd state changes at level 0 instead 1355/head
Samuel Just [Mon, 3 Mar 2014 21:53:54 +0000 (13:53 -0800)]
OSD,config_opts: log osd state changes at level 0 instead

Signed-off-by: Samuel Just <sam.just@inktank.com>
11 years agoMerge pull request #1341 from ceph/wip-osd-status
Sage Weil [Mon, 3 Mar 2014 19:21:11 +0000 (11:21 -0800)]
Merge pull request #1341 from ceph/wip-osd-status

osd: 'status' admin socket command

Reviewed-by: Loic Dachary <loic@dachary.org>
11 years agoosd: be a bit more verbose on startup
Sage Weil [Sat, 1 Mar 2014 09:32:29 +0000 (01:32 -0800)]
osd: be a bit more verbose on startup

load_pgs can take a while and it is nice to know what ceph-osd is doing
without cranking up logging.

Did a quick audit of dout(1)'s and making this the default.  This lets us
see basic OSD state changes (load_pgs, boot, active) at the default level.

At this point all osd state changes should be logged at level 1.

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoMerge branch 'wip-hint' into firefly
Ilya Dryomov [Mon, 3 Mar 2014 18:37:24 +0000 (20:37 +0200)]
Merge branch 'wip-hint' into firefly

Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
11 years agolibrbd: prefix rbd writes with CEPH_OSD_OP_SETALLOCHINT osd op
Ilya Dryomov [Fri, 21 Feb 2014 14:34:14 +0000 (16:34 +0200)]
librbd: prefix rbd writes with CEPH_OSD_OP_SETALLOCHINT osd op

In an effort to reduce fragmentation, prefix every rbd write with
a CEPH_OSD_OP_SETALLOCHINT osd op with an expected_write_size value set
to the object size (1 << order).  Backwards compatibility is taken care
of on the osd side.

"The CEPH_OSD_OP_SETALLOCHINT hint is durable, in that it's enough to
do it once.  The reason every rbd write is prefixed is that rbd doesn't
explicitly create objects and relies on writes creating them
implicitly, so there is no place to stick a single hint op into.  To
get around that we decided to prefix every rbd write with a hint (just
like write and setattr ops, hint op will create an object implicitly if
it doesn't exist)."

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
11 years agoFileStore: add option to cap alloc hint size
Ilya Dryomov [Fri, 21 Feb 2014 14:34:14 +0000 (16:34 +0200)]
FileStore: add option to cap alloc hint size

Add a new config option, filestore_max_alloc_hint_size, to cap
SETALLOCHINT hint size.  The unit is a byte, the default value is
1 megabyte.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
11 years agoFileStore: introduce XfsFileStoreBackend class
Ilya Dryomov [Fri, 21 Feb 2014 14:34:13 +0000 (16:34 +0200)]
FileStore: introduce XfsFileStoreBackend class

Introduce XfsFileStoreBackend class, currently the only filestore
backend implementing SETALLOCHINT op.  This commit adds a build-time
dependency on libxfs as xfs-specific ioctl (XFS_IOC_FSSETXATTR /
XFS_XFLAG_EXTSIZE) is used to implement the new set_alloc_hint()
method.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
11 years agoFileStore: refactor FS detection checks a bit
Ilya Dryomov [Fri, 21 Feb 2014 14:34:13 +0000 (16:34 +0200)]
FileStore: refactor FS detection checks a bit

Refactor FS detection checks in FileStore::_detect_fs() so that they
look the same as the ones in FileStore::mkfs().  This is in preparation
for adding XfsFileStoreBackend class.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
11 years agoosd: add SETALLOCHINT operation
Ilya Dryomov [Fri, 21 Feb 2014 14:34:13 +0000 (16:34 +0200)]
osd: add SETALLOCHINT operation

This is primarily for librbd/krbd's benefit and is supposed to combat
fragmentation:

"... knowing that rbd images have a 4m size, librbd can pass a hint
that will let the osd do the xfs allocation size ioctl on new files so
that they are allocated in 1m or 4m chunks.  We've seen cases where
users with rbd workloads have very high levels of fragmentation in xfs
and this would mitigate that and probably have a pretty nice
performance benefit."

SETALLOCHINT is considered advisory, so our backwards compatibility
mechanism here is to set FAILOK flag for all SETALLOCHINT ops.

xfs is hooked up in the subsequent commits.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
11 years agocommon: -- support for env_to_vec 1352/head
Loic Dachary [Mon, 3 Mar 2014 17:17:01 +0000 (18:17 +0100)]
common: -- support for env_to_vec

When CEPH_ARGS is parsed each side of the -- must be appended to the
corresponding side of the existing argument list. For instance when

   -a -b -- foo bar

is merged with a CEPH_ARGS containing

   -c -d -- frob nitz

it must become

   -a -b -c -d -- foo bar frob nitz

http://tracker.ceph.com/issues/7578 fixes #7578

Signed-off-by: Loic Dachary <loic@dachary.org>
11 years agoRevert "ObjectCacher: remove unused target/max setters"
Josh Durgin [Mon, 3 Mar 2014 17:07:30 +0000 (09:07 -0800)]
Revert "ObjectCacher: remove unused target/max setters"

This reverts commit e1a49e5386c3ed4a6bc4870f01630349cb04d749.

11 years agoRevert "librbd: remove limit on number of objects in the cache"
Josh Durgin [Mon, 3 Mar 2014 17:03:29 +0000 (09:03 -0800)]
Revert "librbd: remove limit on number of objects in the cache"

Disabling this limit causes too much memory usage in some
workloads.

This reverts commit 0559d31db29ea83bdb6cec72b830d16b44e3cd35.

11 years agorgw: off-by-one in rgw_trim_whitespace()
Ray Lv [Wed, 26 Feb 2014 13:17:32 +0000 (21:17 +0800)]
rgw: off-by-one in rgw_trim_whitespace()

Fixes: #7543
Backport: dumpling

Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Signed-off-by: Ray Lv <raylv@yahoo-inc.com>
11 years agoIn database delete Session.flush() has to be called appropriately, to avoid foreign...
Babu Shanmugam [Wed, 19 Feb 2014 12:54:46 +0000 (12:54 +0000)]
In database delete Session.flush() has to be called appropriately, to avoid foreign key conflicts in delete() request to the database

Signed-off-by: Babu Shanmugam <anbu@enovance.com>
11 years agoFollowing changes are made
Babu Shanmugam [Wed, 19 Feb 2014 12:43:53 +0000 (12:43 +0000)]
Following changes are made
1. Increased the String length for distro, version and os_desc columns in osds_info table
2. Corrected version information extraction in client/ceph-brag
3. Removed the version_id json entry when version list returned for UUID
4. Updated the README to reflect point 3

Signed-off-by: Babu Shanmugam <anbu@enovance.com>
11 years agoModifed the String variables in db.py to be of fixed length to support databases...
Babu Shanmugam [Wed, 19 Feb 2014 12:01:11 +0000 (12:01 +0000)]
Modifed the String variables in db.py to be of fixed length to support databases which doesn't have VARCHAR support

Signed-off-by: Babu Shanmugam <anbu@enovance.com>
11 years agoAdded an instruction in 'How to deploy' field in README.md
Babu Shanmugam [Mon, 17 Feb 2014 05:01:19 +0000 (05:01 +0000)]
Added an instruction in 'How to deploy' field in README.md

Signed-off-by: Babu Shanmugam <anbu@enovance.com>
11 years agoqa: workunits: cephtool: test 'osd bench' limits 1324/head 1351/head
Joao Eduardo Luis [Mon, 3 Mar 2014 15:28:04 +0000 (15:28 +0000)]
qa: workunits: cephtool: test 'osd bench' limits

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
11 years agoosd: 'status' admin socket command 1341/head
Sage Weil [Mon, 3 Mar 2014 15:03:01 +0000 (07:03 -0800)]
osd: 'status' admin socket command

Basic stuff, like what state is the OSD in, and what osdmap epoch are
we on.

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoMerge pull request #1327 from dachary/wip-7423
Sage Weil [Mon, 3 Mar 2014 14:59:22 +0000 (06:59 -0800)]
Merge pull request #1327 from dachary/wip-7423

osd: do not attempt to read past the object size

Reviewed-by: Sage Weil <sage@inktank.com>
11 years agoosd: OSD: limit the value of 'size' and 'count' on 'osd bench'
Joao Eduardo Luis [Mon, 3 Mar 2014 14:40:07 +0000 (14:40 +0000)]
osd: OSD: limit the value of 'size' and 'count' on 'osd bench'

Otherwise, a high enough 'count' value will trigger all sorts of timeouts
on the OSD; a low enough 'size' value will have the same effect for a
high enough value of 'count' (even the default value may have ill effects
on the osd's behaviour).  Limiting these values do not fix how 'osd bench'
should behave, but avoid someone from inadvertently bork an OSD.

Four options have been added and the user may adjust them if he so
desires to play with the OSD's fate:

 - 'osd_bench_small_size_max_iops' [default: 100] defines the amount of
   expected IOPS for a small block size (i.e., <1MB).
 - 'osd_bench_large_size_max_throughput' [default: 100<<20] defines
   the expected throughput in B/s.  We assume 100MB/s.
 - 'osd_bench_max_block_size' [default: 64 << 20] caps the block size
   allowed.  We have defined 64 MB.
 - 'osd_bench_duration' [default: 30] caps the expected duration.  This
   values is used when calculating the maximum allowed 'count', and is
   not enforced as the maximum duration of the operation.  If other IO
   is undergoing, or 'osd bench' is somehow slowed down, 'osd bench' may
   go over this duration.  Adjusting this option does however allow the
   user to specify higher 'count' values for (e.g.) a small block size,
   as the operation is assumed to perform the operation over a longer
   time span.

These options attempt to avoid combinations of dangerous parameters.  For
instance, we limit the block size to 64 MB (by default) so that there is
no temptation to specify a large enough block size, along with a very small
'count', such that the end result is similar to specifying a big count with
a sane block size.

Fixes: 7248
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
11 years agoerasure-code: test rados put and get 1327/head
Loic Dachary [Sun, 2 Mar 2014 16:54:39 +0000 (17:54 +0100)]
erasure-code: test rados put and get

Check that rados put immediately followed by rados get retrieves exactly
the same content.

http://tracker.ceph.com/issues/7423 refs #7423

Signed-off-by: Loic Dachary <loic@dachary.org>
11 years agomon: prepend current directory to PATH for tests
Loic Dachary [Sun, 2 Mar 2014 16:53:08 +0000 (17:53 +0100)]
mon: prepend current directory to PATH for tests

So that binaries found in the source directory are always prefered to
installed binaries or scripts.

Signed-off-by: Loic Dachary <loic@dachary.org>
11 years agoosd: helper to create an OSD for functional tests
Loic Dachary [Sun, 2 Mar 2014 16:48:25 +0000 (17:48 +0100)]
osd: helper to create an OSD for functional tests

Signed-off-by: Loic Dachary <loic@dachary.org>
11 years agomon: add mon-test-helpers.sh to EXTRA_DIST
Loic Dachary [Sun, 2 Mar 2014 16:47:15 +0000 (17:47 +0100)]
mon: add mon-test-helpers.sh to EXTRA_DIST

Signed-off-by: Loic Dachary <loic@dachary.org>
11 years agoosd: do not attempt to read past the object size
Loic Dachary [Fri, 28 Feb 2014 12:57:20 +0000 (13:57 +0100)]
osd: do not attempt to read past the object size

When reading from a replicated pool, trying to read more than the object
size results in a short read that does not go beyond the object size. In
erasure coded pools, objects are padded and the read will return more
bytes than the object actually contains.

http://tracker.ceph.com/issues/7423 fixes #7423

Signed-off-by: Loic Dachary <loic@dachary.org>
11 years agoMerge pull request #1344 from ceph/wip-7539
Sage Weil [Mon, 3 Mar 2014 05:06:01 +0000 (21:06 -0800)]
Merge pull request #1344 from ceph/wip-7539

Wip 7539

Reviewed-by: Sage Weil <sage@inktank.com>
11 years agoMerge pull request #1322 from ceph/wip-librados-end-iterator
Sage Weil [Sun, 2 Mar 2014 20:51:30 +0000 (12:51 -0800)]
Merge pull request #1322 from ceph/wip-librados-end-iterator

librados: fix ObjectIterator::operator= for the end iterator

Reviewed-by: Sage Weil <sage@inktank.com>
11 years agoMerge pull request #1337 from ceph/wip-fix-coverity-20140228
Sage Weil [Sun, 2 Mar 2014 03:56:45 +0000 (19:56 -0800)]
Merge pull request #1337 from ceph/wip-fix-coverity-20140228

Fix different issues found by Coverity

Reviewed-by: Sage Weil <sage@inktank.com>
11 years agoMerge pull request #1336 from ceph/wip-nfs-export
Sage Weil [Sun, 2 Mar 2014 03:54:23 +0000 (19:54 -0800)]
Merge pull request #1336 from ceph/wip-nfs-export

Wip nfs export

Reviewed-by: Sage Weil <sage@inktank.com>
11 years agoosd_types,PG: trim mod_desc for log entries to min size 1344/head
Samuel Just [Sat, 1 Mar 2014 22:33:11 +0000 (14:33 -0800)]
osd_types,PG: trim mod_desc for log entries to min size

In the event that mod_desc.bl contains pointers into a large
message buffer, we'd otherwise end up keeping around the entire
MOSDECSubOpWrite which created each log entry.

Fixes: #7539
Signed-off-by: Samuel Just <sam.just@inktank.com>
11 years agoMOSDECSubOpWrite: drop transaction, log_entries in clear_buffers
Samuel Just [Sat, 1 Mar 2014 22:12:09 +0000 (14:12 -0800)]
MOSDECSubOpWrite: drop transaction, log_entries in clear_buffers

Signed-off-by: Samuel Just <sam.just@inktank.com>
11 years agoTrackedOp: clear_payload as well in unregister_inflight_op
Samuel Just [Sat, 1 Mar 2014 21:55:24 +0000 (13:55 -0800)]
TrackedOp: clear_payload as well in unregister_inflight_op

We want to minimize the cost of maintaining the historic ops.

Signed-off-by: Samuel Just <sam.just@inktank.com>
11 years agoOpTracker: clarify that unregister_inflight_op is only called if enabled
Samuel Just [Sat, 1 Mar 2014 21:54:53 +0000 (13:54 -0800)]
OpTracker: clarify that unregister_inflight_op is only called if enabled

The !tracking_enabled branch actually had a leak which was unreachable
since the caller does the check for tracking_enabled.

Signed-off-by: Samuel Just <sam.just@inktank.com>