]> git.apps.os.sepia.ceph.com Git - ceph.git/log
ceph.git
13 years agoMerge remote-tracking branch 'upstream/wip-leveldb-iterators'
Samuel Just [Tue, 31 Jul 2012 20:51:49 +0000 (13:51 -0700)]
Merge remote-tracking branch 'upstream/wip-leveldb-iterators'

13 years agoPG,ReplicatedPG: clarify scrub state clearing
Samuel Just [Mon, 30 Jul 2012 20:43:51 +0000 (13:43 -0700)]
PG,ReplicatedPG: clarify scrub state clearing

scrub_clear_state takes care of clearing the SCRUB and REPAIR
flags.  Thus, PG::scrub() needn't clear them again since
any change that would have caused that if block to occur
would have triggered ReplicatedPG::on_change(), which also
clears the scrub reservations.

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agoPG::mark_clean(): queue_snap_trim if snap_trimq is not empty
Samuel Just [Mon, 30 Jul 2012 20:38:08 +0000 (13:38 -0700)]
PG::mark_clean(): queue_snap_trim if snap_trimq is not empty

Currently, we won't queue for snap trim until the next map
update.

Noticed while reviewing another patch, this would result in
snaps not being trimmed until the next map update.

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agoReplicatedPG::snap_trimmer: requeue if scrub_block_writes
Samuel Just [Mon, 30 Jul 2012 20:36:39 +0000 (13:36 -0700)]
ReplicatedPG::snap_trimmer: requeue if scrub_block_writes

Otherwise, we do not continue snap_trimming once scrub is
complete.

Noticed while revewing another patch.  This would result
in snaps not being trimmed again until the next map
update.

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agoMerge branch 'wip-osd'
Sage Weil [Mon, 30 Jul 2012 17:49:44 +0000 (10:49 -0700)]
Merge branch 'wip-osd'

Reviewed-by: Samuel Just <sam.just@inktank.com>
13 years agoosd: initialize send_notify on pg load
Sage Weil [Sat, 28 Jul 2012 16:19:03 +0000 (09:19 -0700)]
osd: initialize send_notify on pg load

When the PG is loaded, we need to set send_notify if we are not the
primary.  Otherwise, if the PG does not go through
start_peering_interval() or experience a role change, we will not set
the flag and tell the primary that we exist.  This can cause problems
for example if we have unfound objects that the primary needs, although
I'm sure there are other bad implications as well.

Fixes: #2866
Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoosd: replace STRAY bit with bool
Sage Weil [Sat, 28 Jul 2012 16:17:34 +0000 (09:17 -0700)]
osd: replace STRAY bit with bool

We were setting a bit in pg->state that is private to the non-primary
PG.  The other bits get shared with the mon etc, but this one didn't.

Replace it with a simple bool.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agotest: test_keyvaluedb_iterators: Test KeyValueDB implementations iterators
Joao Eduardo Luis [Wed, 18 Jul 2012 21:26:29 +0000 (22:26 +0100)]
test: test_keyvaluedb_iterators: Test KeyValueDB implementations iterators

This set of tests focus on testing the expected behavior of LevelDBStore's
and KeyValueDBMemory's iterators.

We test a grand total of six use cases, each one with several test
units, being tested for both the LevelDBStore and the in-memory mock
(totalling 48 test units, plus two disabled by default):

 * Removing keys:
  - Using both the whole-space iterator and the whole-space snapshot
    iterator
  - Tests key removal while iterating the store, either by prefix or by
    removing specific (prefix,key) pairs

 * Setting keys:
  - Using both the whole-space iterator and the whole-space snapshot
    iterator
  - Tests key insertion while iterating the store
  - Tests value update while iterating the store
  - This use case has two disabled tests: one when setting keys, other
    when updating values, both on LevelDBStore and using the whole-space
    iterator; this is because they will fail, unlike when using the
    in-memory mock implementation, because leveldb implicitely creates
    an iterator that will read from a snapshot instead of directly from
    the underlying store.

 * Using Upper/Lower Bounds:
  - Using the whole-space iterator (we don't modify the store's state,
    so there is no need to also test the whole-space snapshot iterator)
  - Tests upper/lower bounds when the key, the prefix or both are empty
  - Tests upper/lower bounds when both the key and the prefix are set

 * Seeking:
  - Using the whole-space iterator (we don't modify the store's state,
    so there is no need to also test the whole-space snapshot iterator)
  - Tests seeking to first and to last
  - Tests seeking to first and to last using a prefix

 * Key-Space Iteration:
  - Using the whole-space iterator (we don't modify the store's state,
    so there is no need to also test the whole-space snapshot iterator)
  - Tests forward and backward iteration over the key-space

 * Empty Store:
  - Using the whole-space iterator (we don't modify the store's state,
    so there is no need to also test the whole-space snapshot iterator)
  - Tests seeking and using bounds functions when the store is empty

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
13 years agoos: KeyValueDB: implement snapshot iterators
Joao Eduardo Luis [Mon, 23 Jul 2012 10:56:50 +0000 (11:56 +0100)]
os: KeyValueDB: implement snapshot iterators

Create a set of functions, to be implemented by derivative classes of
KeyValueDB, responsible for returning an iterator with strong
read-consistency guarantees. How this iterator is implemented, or by what
is it backed up, is implementation specific, but it must guarantee that
all reads made using this iterator are as if there were no subsequent
writes to the store since we created the iterator.

For instance, LevelDBStore will back this iterator with a leveldb Snapshot,
while KeyValueDBMemory will perform a copy of its in-memory map.

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
13 years agoos: KeyValueDB: re-implement (prefix) iter in terms of whole-space iter
Joao Eduardo Luis [Mon, 23 Jul 2012 10:47:00 +0000 (11:47 +0100)]
os: KeyValueDB: re-implement (prefix) iter in terms of whole-space iter

In-a-nutshell-version: Create a whole-space iterator interface, and
implement the already existing, prefix-based iterator in terms of the
new whole-space iterator;

This patch introduces a significant change on the architecture of
KeyValueDB's iterator, although its interface remains the same.

Before this patch, KeyValueDB simply defined an interface for a
prefix-based interface, to be implemented by derivative classes. Being
constrained by a prefix-based approach to iterate over the store only makes
sense when we know which prefixes we want to iterate over, but for that we
must know about the prefixes beforehand. This approach didn't work when one
wanted to iterate over the whole key space, without any previous awareness
about the keys and their prefixes.

This patch introduces a new interface for a whole-space iterator, to be
implemented by derivative classes, which is prefix-independent. We also
define an abstract function to obtain this iterator, which must also be
implemented by the derivative class. With this interface in place, we are
then able to implement a prefix-dependent iterator in terms of the
whole-space iterator, which will be offered by the KeyValueDB class itself.

Furthermore, we implement these changes on LevelDBStore and KeyValueDBMemory,
the in-memory mock store, which leads to significant changes on both:

  * LevelDBStore
    - Substitute the previously existing LevelDBIteratorImpl, which
      followed a prefix-based iteration, for
      LevelDBWholeSpaceIteratorImpl, which now iterates over the whole
      key space of the store;

  * KeyValueDBMemory:
    - Substitute the previously existing MemIterator, which followed a
      prefix-based iteration, for WholeSpaceMemIterator, which now
      iterates over the whole key space of the in-memory mock store;
    - Change the in-memory mock store data structure. Previously, we
      used a map-of-maps, mapping prefixes to a key/value map; now we
      keep a single map, mapping (prefix,key) pairs to values.

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
13 years agotest: workloadgen: Don't linearly iterate over a map to obtain a collection
Joao Eduardo Luis [Tue, 24 Jul 2012 20:53:20 +0000 (21:53 +0100)]
test: workloadgen: Don't linearly iterate over a map to obtain a collection

We were iterating over the collections map a certain amount of times, in
order to obtain the collection in that position. To avoid this kind of
behavior in a function that may be called a large amount of times, and
that may iterate over a rather large map, we now keep the collection ids
in a vector. In order to obtain a given collection on position X, we will
simply look for the collection id on position X of the vector, and then
obtain the collection from the map using its collection id.

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
13 years agoosd: peering: make Incomplete a Peering substate
Sage Weil [Fri, 27 Jul 2012 23:03:26 +0000 (16:03 -0700)]
osd: peering: make Incomplete a Peering substate

This allows us to still catch changes in the prior set that would affect
our conclusions (that we are incomplete) and, when they happen, restart
peering.

Consider:
 - calc prior set, osd A is down
 - query everyone else, no good info
 - set down, go to Incomplete (previously WaitActingChange) state.
 - osd A comes back up (we do nothing)
 - osd A sends notify message with good info (we ignore)

By making this a Peering substate, we catch the Peering AdvMap reaction,
which will notice a prior set down osd is now up and move to Reset.

Fixes: #2860
Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoosd: peering: move to Incomplete when.. incomplete
Sage Weil [Fri, 27 Jul 2012 22:39:40 +0000 (15:39 -0700)]
osd: peering: move to Incomplete when.. incomplete

PG::choose_acting() may return false and *not* request an acting set change
if it can't find any suitable peers with enough info to recover.  In that
case, we should move to Incomplete, not WaitActingChange, just like we do
a bit lower in GetLog() if we have non-contiguous logs.  The state name is
more accurate, and this is also needed to fix bug #2860.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoMerge remote-tracking branch 'gh/wip-msgr-masterbits'
Sage Weil [Sat, 28 Jul 2012 14:21:05 +0000 (07:21 -0700)]
Merge remote-tracking branch 'gh/wip-msgr-masterbits'

Reviewed-by: Greg Farnum <greg@inktank.com>
13 years agoconfig: send warnings to a ostream* argument
Sage Weil [Sat, 28 Jul 2012 14:39:27 +0000 (07:39 -0700)]
config: send warnings to a ostream* argument

We shouldn't always send these to stderr.  (Among other things, the
warning: prefix breaks the gitbuilder error detection.)

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agovstart.sh: apply extra conf after the defaults
Sage Weil [Fri, 27 Jul 2012 21:28:04 +0000 (14:28 -0700)]
vstart.sh: apply extra conf after the defaults

This let's you do e.g., -o 'debug ms = 100' and it will apply after
the default logging levels.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoconf: make dup lines override previous value
Sage Weil [Fri, 20 Jul 2012 15:55:21 +0000 (08:55 -0700)]
conf: make dup lines override previous value

If you put

[some section]
 foo = 1
 ...
 foo = 2

in a .conf file, make the second key override the first.

Generate a warning if a value is overridden to sidestep some user hangbanging.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agomon: remove superfluous "can't delete except on master" comments
Sage Weil [Mon, 23 Jul 2012 21:39:10 +0000 (14:39 -0700)]
mon: remove superfluous "can't delete except on master" comments

That's what 'return false' means for preprocess_*().

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agomon: make pool snap creation ops idempotent
Sage Weil [Mon, 23 Jul 2012 23:04:00 +0000 (16:04 -0700)]
mon: make pool snap creation ops idempotent

Return 0 if the snap already exists, or is already deleted.

Also, avoid updating the pg_pool if we are just waiting for the current
round to commit.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoobjecter: return ENOENT/EEXIST on pool snap delete/create
Sage Weil [Mon, 23 Jul 2012 23:01:14 +0000 (16:01 -0700)]
objecter: return ENOENT/EEXIST on pool snap delete/create

Do these checks on the client to mask monitor idempotency from the user.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agolibrados: make snap create/destroy handle client-side errors
Sage Weil [Mon, 23 Jul 2012 23:29:07 +0000 (16:29 -0700)]
librados: make snap create/destroy handle client-side errors

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agomon: check for invalid pool snap creates in preprocess_op, too
Sage Weil [Fri, 20 Jul 2012 00:35:57 +0000 (17:35 -0700)]
mon: check for invalid pool snap creates in preprocess_op, too

This avoids waiting for a paxos commit just to return an error.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoqa: simple tests for 'ceph osd create|rm' commands
Sage Weil [Fri, 20 Jul 2012 05:04:29 +0000 (22:04 -0700)]
qa: simple tests for 'ceph osd create|rm' commands

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agomon: make 'osd rm ...' idempotent
Sage Weil [Fri, 20 Jul 2012 00:10:57 +0000 (17:10 -0700)]
mon: make 'osd rm ...' idempotent

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoqa: simple test for pool create/delete commands
Sage Weil [Fri, 20 Jul 2012 04:56:56 +0000 (21:56 -0700)]
qa: simple test for pool create/delete commands

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agomon: make pool creation idempotent
Sage Weil [Mon, 23 Jul 2012 23:03:46 +0000 (16:03 -0700)]
mon: make pool creation idempotent

Return success if the pool already exists.  Part of #2638.

Also, fix this so we wait until a creating pool is created before we reply.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agomon: make pool removal idempotent
Sage Weil [Mon, 23 Jul 2012 23:03:36 +0000 (16:03 -0700)]
mon: make pool removal idempotent

Return success if pool does not exist.  Part of #2638.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoobjecter: make pool create/delete return EEXIST/ENOENT
Sage Weil [Mon, 23 Jul 2012 22:37:06 +0000 (15:37 -0700)]
objecter: make pool create/delete return EEXIST/ENOENT

Do these checks on the client side to mask monitor idempotency from
the user.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agolibrados: make pool create/destroy handle client-side errors
Sage Weil [Mon, 23 Jul 2012 23:23:20 +0000 (16:23 -0700)]
librados: make pool create/destroy handle client-side errors

Add tests!

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoobjecter: fix mon command resends
Sage Weil [Mon, 23 Jul 2012 21:41:17 +0000 (14:41 -0700)]
objecter: fix mon command resends

The monitor session is lossy.  Send these when the op is initiated, or
when we reconnect.  The timeout/cutoff was preventing ops from getting
resent if there was an ill-timed mon reset.

Backport: testing, stable/argonaut
Signed-off-by: Sage Weil <sage@inktank.com>
13 years agomutex: assert we are unlocked by the same thread that locked
Sage Weil [Tue, 17 Jul 2012 20:52:57 +0000 (13:52 -0700)]
mutex: assert we are unlocked by the same thread that locked

This only works for non-recursive locks.  (Which is probably all of them?)

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agocond: reorder asserts
Sage Weil [Tue, 17 Jul 2012 20:52:00 +0000 (13:52 -0700)]
cond: reorder asserts

Make the more specific checks assert before the less specific ones, so we
are more likely to crash with useful information.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoosd: fixing sharing of past_intervals on backfill restart
Sage Weil [Thu, 26 Jul 2012 23:35:00 +0000 (16:35 -0700)]
osd: fixing sharing of past_intervals on backfill restart

We need to share past_intervals whenever we instantiate the PG on a peer.
In the PG activation case, this is based on whether our peer_info[] value
for that peer is dne().  However, the backfill code was updating the
peer info (history) in the block preceeding the dne() check, which meant
we never shared past_intervals in this case and the peer would have to
chew through a potentially large number of maps if the PG has not been
clean recently.

Fix by checking dne() prior to the backfill block.  We still need to fill
in the message later because it isn't yet instantiated.

Fixes: #2849
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
13 years agofilestore: check for EIO in read path
Sage Weil [Fri, 27 Jul 2012 04:55:00 +0000 (21:55 -0700)]
filestore: check for EIO in read path

Check for EIO in read methods and helpers.  Try to do checks in low-level
methods (e.g., lfn_*()) to avoid duplication in higher-level methods.

The transaction apply function already checks for EIO on writes, and will
generate a nicer error message, so we can largely ignore the write path,
as long as errors get passed up correctly.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
13 years agofilestore: add 'filestore fail eio' option, default true
Sage Weil [Thu, 26 Jul 2012 16:07:46 +0000 (09:07 -0700)]
filestore: add 'filestore fail eio' option, default true

By default we will assert/fail/crash on EIO from the underlying fs.  We
already do this in the write path, but not the read path, or in various
internal infrastructure.

Signed-off-by: Sage Weil <sage@inktank.com>
Conflicts:

src/os/FileStore.cc

13 years agolibrbd: fix id initialization in new format
Josh Durgin [Thu, 26 Jul 2012 22:27:49 +0000 (15:27 -0700)]
librbd: fix id initialization in new format

48bd839b1e25b063c675416a8f6233463f1af115 should have included this.
I misread it due to the use of bid instead of id when generating
the object prefix.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
13 years agomon: set a configurable max osd cap
Yehuda Sadeh [Thu, 26 Jul 2012 20:14:40 +0000 (13:14 -0700)]
mon: set a configurable max osd cap

Don't allow setting a higher osd num through the
ceph control util.

Fixes: #2752
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
13 years agodoc: updates to fix problem with ceph-cookbooks appearing in chef-server.
John Wilkins [Wed, 25 Jul 2012 22:57:58 +0000 (15:57 -0700)]
doc: updates to fix problem with ceph-cookbooks appearing in chef-server.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
13 years agoosd: generate past intervals in parallel on boot
Sage Weil [Wed, 25 Jul 2012 17:57:35 +0000 (10:57 -0700)]
osd: generate past intervals in parallel on boot

Even though we aggressively share past_intervals with notifies etc, it is
still possible for an osd to get buried behind a pile of old maps and need
to generate these if it has been out of the cluster for a while.  This has
happened to us in the past but, sadly, we did not merge the work then.
On the bright side, this implementation is much much much cleaner than the
old one because of the pg_interval_t helper we've since switched to.

On bootup, we look at the intervals each pg needs and calclate the union,
and then iterate over that map range.  The inner bit of the loop is
functionally identical to PG::build_past_intervals(), keeping the per-pg
state in the pistate struct.

Backport: argonaut
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
13 years agoosd: move calculation of past_interval range into helper
Sage Weil [Wed, 25 Jul 2012 17:58:07 +0000 (10:58 -0700)]
osd: move calculation of past_interval range into helper

PG::generate_past_intervals() first calculates the range over which it
needs to generate past intervals.  Do this in a helper function.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Conflicts:

src/osd/PG.cc

13 years agoosd: fix map epoch boot condition
Sage Weil [Wed, 25 Jul 2012 17:58:28 +0000 (10:58 -0700)]
osd: fix map epoch boot condition

We only want to join the cluster if we can catch up to the latest
osdmap with a small number of maps, in this case a single map message.

Backport: argonaut
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Conflicts:

src/osd/OSD.cc

13 years agoosd: avoid misc work before we're active
Sage Weil [Wed, 25 Jul 2012 03:54:11 +0000 (20:54 -0700)]
osd: avoid misc work before we're active

If we're booting, we shouldn't scrub, or send reports to the montior,
or send heartbeats, or any of that.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agomon: ignore pgtemp messages from down osds
Sage Weil [Wed, 25 Jul 2012 03:18:01 +0000 (20:18 -0700)]
mon: ignore pgtemp messages from down osds

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agomon: ignore osd_alive messages from down osds
Sage Weil [Wed, 25 Jul 2012 03:16:04 +0000 (20:16 -0700)]
mon: ignore osd_alive messages from down osds

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoadmin_socket: json output, always
Sage Weil [Tue, 24 Jul 2012 21:53:06 +0000 (14:53 -0700)]
admin_socket: json output, always

If the perfcounters stuff were refactored to use the Formatter, we could
put the JSONFormatter in the admin_socket code and make this a bit less
annoying.  Later.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoadmin_socket: dump config in json; add test
Sage Weil [Wed, 25 Jul 2012 00:23:03 +0000 (17:23 -0700)]
admin_socket: dump config in json; add test

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoMerge branch 'next'
Sage Weil [Wed, 25 Jul 2012 00:22:50 +0000 (17:22 -0700)]
Merge branch 'next'

13 years agoconfig: fix 'config set' admin socket command
Sage Weil [Tue, 24 Jul 2012 20:53:03 +0000 (13:53 -0700)]
config: fix 'config set' admin socket command

Fixes: #2832
Backport: argonaut
Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoMerge branch 'next'
Sage Weil [Tue, 24 Jul 2012 18:49:41 +0000 (11:49 -0700)]
Merge branch 'next'

13 years agoosd: fix pg log zeroing
Sage Weil [Tue, 24 Jul 2012 18:02:37 +0000 (11:02 -0700)]
osd: fix pg log zeroing

Zero the right number of bytes.  Fixes a bug where we clobber legit log
data.  Fortunately this is only triggered with osd preserve pg log = false,
which was not the default until recently in master.

Fixes: #2799
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Mike Ryan <mike.ryan@inktank.com>
13 years agoMerge branch 'wip-2763'
Yehuda Sadeh [Tue, 24 Jul 2012 17:10:22 +0000 (10:10 -0700)]
Merge branch 'wip-2763'

13 years agoWireshark dissector updated, work with the current development tree of wireshark...
Pierre Rognant [Wed, 25 Apr 2012 14:23:50 +0000 (16:23 +0200)]
Wireshark dissector updated, work with the current development tree of wireshark. The way I patched it is not really clean, but it can be useful if some people quickly need to inspect ceph network flows.

13 years agowireshar/ceph/packet-ceph.c: fix eol
Yehuda Sadeh [Tue, 17 Jul 2012 17:15:27 +0000 (10:15 -0700)]
wireshar/ceph/packet-ceph.c: fix eol

Removing extra char from dos eol format.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
13 years agoos: KeyValueDB: Add virtual raw_key() function to return (prefix,key) pair
Joao Eduardo Luis [Mon, 23 Jul 2012 18:48:43 +0000 (19:48 +0100)]
os: KeyValueDB: Add virtual raw_key() function to return (prefix,key) pair

If we were to use solely the key() function, whenever we had a key with,
say, prefix 'Foo' and key 'Bar', the key() function would return something
similar to 'Foo<separator>Bar'. Therefore, obtaining the prefix and the key
would require one to be aware of the separator used, and, since that is
implementation specific, we can't rely on such prior knowledge.

This new function must then be implemented by any derivative class of
KeyValueDB, and is expected to return a pair (prefix,key) for the
current iterator's position -- the key() function should behave as
previously, returning only the 'key' component of the pair.

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
13 years agoos: KeyValueDB: allow finer-grained control of transaction operations
Joao Eduardo Luis [Tue, 24 Jul 2012 01:23:01 +0000 (02:23 +0100)]
os: KeyValueDB: allow finer-grained control of transaction operations

This patch introduces the possibility of using single key/value
modification operations into the transaction interface.

Until now, any 'set' or 'rmkeys' operations required a map of keys to be
provided to the function, which made the task of removing or setting a
bunch of keys easier. Doing these same operations for a single key,
however, would entail creating a map with a single key.

Instead, this patch adds two new virtual abstract functions, to be
implemented by derivative classes, which set or remove one single
key/value, and we then implement the map-based, existing functions in
terms of these new functions.

We also update the derivative classes of KeyValueDB in order to reflect
these changes (i.e., LevelDBStore and KeyValueDBMemory).

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
13 years agodoc: update information about stable vs development releases
Sage Weil [Tue, 24 Jul 2012 00:39:12 +0000 (17:39 -0700)]
doc: update information about stable vs development releases

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agolibrbd: replace assign_bid with client id and random number
Josh Durgin [Mon, 23 Jul 2012 21:05:53 +0000 (14:05 -0700)]
librbd: replace assign_bid with client id and random number

The assign_bid method has issues with replay because it is a write
that also returns data. This means that the replayed operation would
return success, but no data, and cause a create to fail. Instead, let
the client set the bid based on its global id and a random number.

This only affects the creation of new images, since the bid is put
into an opaque string as part of the object prefix.

Keep the server side assign_bid around in case there are old clients
still using it.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
13 years agoosd: fix ACK ordering on resent ops
Sage Weil [Mon, 23 Jul 2012 23:51:03 +0000 (16:51 -0700)]
osd: fix ACK ordering on resent ops

The wait_for_ondisk handling fixed COMMIT ordering, but the ACKs need to
go back in the same order too.  For example:

 - op A is queued
 - client disconnects, both ACK and COMMIT replies are lost
 - client reconnects
 - op A and B are sent
 - op A is queued
 - op B is applied, ACK is sent
 - op A and B COMMITs are sent
 -> client's ack callbacks will see B and then A.

Fix this by creating a waiting_for_ack queue as well, and sending ACK
responses as needed.  Also handle the case where the ACK should be sent
immediately when the retry event is received.

Fixes: #2823
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Mike Ryan <mike.ryan@inktank.com>
13 years agorados::cls::lock: move api types into namespace
Yehuda Sadeh [Mon, 23 Jul 2012 23:01:32 +0000 (16:01 -0700)]
rados::cls::lock: move api types into namespace

By popular demand, moved public api into namespace. This
required some changes to ceph_dencoder to get some template
annoyance working.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
13 years agoMerge tag 'v0.49'
Sage Weil [Mon, 23 Jul 2012 19:43:19 +0000 (12:43 -0700)]
Merge tag 'v0.49'

v0.49

13 years agov0.49 v0.49
Sage Weil [Sat, 21 Jul 2012 06:26:56 +0000 (23:26 -0700)]
v0.49

13 years agomon: make 'ceph osd rm ...' wipe out all state bits, not just EXISTS
Sage Weil [Mon, 23 Jul 2012 17:47:10 +0000 (10:47 -0700)]
mon: make 'ceph osd rm ...' wipe out all state bits, not just EXISTS

This ensures that when a new osd reclaims that id it behaves as if it were
really new.

Backport: argonaut
Signed-off-by: Sage Weil <sage@inktank.com>
13 years agomkcephfs: add sync between btrfs scan and mount
Sage Weil [Mon, 23 Jul 2012 16:21:09 +0000 (09:21 -0700)]
mkcephfs: add sync between btrfs scan and mount

This appears to fix problems with mount failing for at least one user.

Reported-by: Paul Pettigrew <Paul.Pettigrew@mach.com.au>
Signed-off-by: Sage Weil <sage@inktank.com>
13 years agocrush: fix name map encoding
Sage Weil [Sat, 21 Jul 2012 16:15:06 +0000 (09:15 -0700)]
crush: fix name map encoding

We screwed up and encoded using the name 'int' type instead of int32_t.
That means people have systems encoding this as both 32 and 64 bit,
depending on their architecture.  This could be worse: x86_64 still has a
32-bit int (at least in my environment).

In any case, mixing both word sizes in their clusters is broken as a
result, with the exception of the kernel code, which doesn't decode this
part of the map and will tolerate differently-sized servers.

Fix this by:

 * encoding using int32_t now
 * decoding either 32-bit or 64-bit values, by assuming that the strings
   will always be non-empty.  This appears to be the case.

However:

 * any cluster with 64-bit ints must upgrade all at once, or else the new
   code will start encoding 32-bit values and the old code will be
   confused.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
13 years agoosd/OpTracker: fix use-after-free
Sage Weil [Sat, 21 Jul 2012 15:24:37 +0000 (08:24 -0700)]
osd/OpTracker: fix use-after-free

And formatting.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoOpRequest,OSD: track recent slow ops
Samuel Just [Fri, 20 Jul 2012 00:43:17 +0000 (17:43 -0700)]
OpRequest,OSD: track recent slow ops

This should be helpful while investigating slow performance.

OpRequests now track events with timestamp in addition
to dumping them to the log.  OpHistory keeps up to a
configurable number of the slowest ops over a configurable
recent time interval.  The admin socket interface for the OSD
now has a dump_historic_ops command which dumps the stored
slow ops.

Reviewed-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agoMerge branch 'next'
Samuel Just [Fri, 20 Jul 2012 21:32:44 +0000 (14:32 -0700)]
Merge branch 'next'

13 years agotest/store_test.cc: verify collection_list_partial results are sorted
Samuel Just [Fri, 20 Jul 2012 20:09:39 +0000 (13:09 -0700)]
test/store_test.cc: verify collection_list_partial results are sorted

Synthetic test now also varies snapshots and uses a small variety of
hashes.

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agocls_lock: cls_lock_id_t -> cls_lock_locker_id_t
Yehuda Sadeh [Fri, 20 Jul 2012 20:41:51 +0000 (13:41 -0700)]
cls_lock: cls_lock_id_t -> cls_lock_locker_id_t

Renamed type to make more sense.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
13 years agocls_lock: document lock properties
Yehuda Sadeh [Fri, 20 Jul 2012 20:28:19 +0000 (13:28 -0700)]
cls_lock: document lock properties

Added some comments about different lock properties.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
13 years agocls_log: update a comment
Yehuda Sadeh [Fri, 20 Jul 2012 20:16:05 +0000 (13:16 -0700)]
cls_log: update a comment

Was missing output param description.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
13 years agorados: lock info keeps expiration, not duration
Yehuda Sadeh [Fri, 20 Jul 2012 20:11:54 +0000 (13:11 -0700)]
rados: lock info keeps expiration, not duration

We pass duration in the request, but internally we keep
the expiration.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
13 years agorados tool: add advisory lock control commands
Yehuda Sadeh [Fri, 20 Jul 2012 20:00:43 +0000 (13:00 -0700)]
rados tool: add advisory lock control commands

Can now lock, break lock, list locks and show lock
info.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
13 years agocls_lock: objclass for advisory locking
Yehuda Sadeh [Fri, 20 Jul 2012 19:59:07 +0000 (12:59 -0700)]
cls_lock: objclass for advisory locking

Providing an objclass to create and manipulate advisory
locking. Also providing a client api to control it. A lock
may either be exclusively locked or shared among multiple
lockers. A locker is identified by the rados client name, and
by a cookie-string.
A lock may be assigned with a tag that every operation on that
lock should use. A lock can be unlocked by the client that locked
it, or may be broken by other clients.
When a non-zero lock duration is assigned to a lock by a locker,
that locker expires after that time duration.
A lock may have a description.
Locks on a specific object can be listed. Lockers of a specific
lock can be enumerated (by get_info).

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
13 years agoobjclass: add api calls to get/set xattrs
Yehuda Sadeh [Fri, 20 Jul 2012 19:55:55 +0000 (12:55 -0700)]
objclass: add api calls to get/set xattrs

added the following functions:
  cls_cxx_getxattr
  cls_cxx_getxattrs
  cls_cxx_setxattr

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
13 years agoos/HashIndex: use set<pair<string, hobject_t>> rather than multimap
Samuel Just [Fri, 20 Jul 2012 19:00:42 +0000 (12:00 -0700)]
os/HashIndex: use set<pair<string, hobject_t>> rather than multimap

Multimap does not make any guarantees about ordering of different
values with the same key.  list_by_hash, however, assumes that
the iterator order matches hobject_t order.  Thus, we use
set<pair<string, hobject_t> > to get the proper ordering.

Backport: stable

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agomon: shut up about sessionless MPGStats messages
Sage Weil [Fri, 20 Jul 2012 01:00:25 +0000 (18:00 -0700)]
mon: shut up about sessionless MPGStats messages

If the mon gets a reset on the client connection, it clears the session
on the connection.  This is perfectly normal to see.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoosd: clean up boot method names
Sage Weil [Fri, 20 Jul 2012 04:27:20 +0000 (21:27 -0700)]
osd: clean up boot method names

Prefix subsequent steps with _.  Better names.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
13 years agoosd: defer boot if heartbeatmap indicates we are unhealthy
Sage Weil [Fri, 20 Jul 2012 04:27:37 +0000 (21:27 -0700)]
osd: defer boot if heartbeatmap indicates we are unhealthy

If the OSD is bogged down or unresponsive, we should not try to join
the cluster.  This was observed on congress (slow/clogged op_tp combined
with osdmap thrashing).

Fixes: #2502
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
13 years agoMerge branch 'next'
Sage Weil [Fri, 20 Jul 2012 03:22:35 +0000 (20:22 -0700)]
Merge branch 'next'

Conflicts:
src/include/ceph_features.h

13 years agoosd/mon: subscribe (onetime) to pg creations on connect
Sage Weil [Thu, 19 Jul 2012 23:47:23 +0000 (16:47 -0700)]
osd/mon: subscribe (onetime) to pg creations on connect

Ask the monitor for pending pg creations each time we connect.

Normally, this is a freebie check.  If there are pending creations, though,
it ensures that the OSD finds out about them even if the original lame
broadcast didn't reach it.  Specifically:

 - osd is hunting for a monitor, but isn't yet connected
 - new pgs are created
 - send_pg_creates() sends out create messages, but osd does get it
 - osd finally connects to a mon

Fixes: #2151 (tho the bug description is bad)
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
13 years agomon: track pg creations by osd
Sage Weil [Wed, 18 Jul 2012 21:54:11 +0000 (14:54 -0700)]
mon: track pg creations by osd

Track the pending pg creations by osd, and use a helper to send out that
messages.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoRevert "rbd: fix usage for snap commands"
Sage Weil [Thu, 19 Jul 2012 23:45:07 +0000 (16:45 -0700)]
Revert "rbd: fix usage for snap commands"

This reverts commit 42de6873f9ca33fc20e70176d9a422635a6f0152.

Actually, these are fine!  Dan made them all kinds of fancy.

13 years agorbd: fix usage for snap commands
Sage Weil [Thu, 19 Jul 2012 23:48:18 +0000 (16:48 -0700)]
rbd: fix usage for snap commands

Snap commands take '--snap <snapname> <imagename>'.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agodoc: add missing dependencies to README
Mike Ryan [Thu, 19 Jul 2012 18:18:19 +0000 (11:18 -0700)]
doc: add missing dependencies to README

Signed-off-by: Mike Ryan <mike.ryan@inktank.com>
13 years agoadd CRUSH_TUNABLES feature bit
Sage Weil [Thu, 19 Jul 2012 02:49:58 +0000 (19:49 -0700)]
add CRUSH_TUNABLES feature bit

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoOSD::handle_osd_map: don't lock pgs while advancing maps
Samuel Just [Wed, 18 Jul 2012 22:37:28 +0000 (15:37 -0700)]
OSD::handle_osd_map: don't lock pgs while advancing maps

We no longer do anything with the pgs here.  PG map
advancing is now handled in OSD::advance_pg asyncronously.

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agoosd: add osd_debug_drop_pg_create_{probability,duration} options
Sage Weil [Wed, 18 Jul 2012 19:20:24 +0000 (12:20 -0700)]
osd: add osd_debug_drop_pg_create_{probability,duration} options

This will let us exercise more of the pg creation code.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoOSD: write_if_dirty during get_or_create_pg after handle_create
Samuel Just [Wed, 18 Jul 2012 19:48:09 +0000 (12:48 -0700)]
OSD: write_if_dirty during get_or_create_pg after handle_create

In the case that the pg is newly created, we will activate during
that call, so the info and log will be dirty.

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agoOSD: actually send queries during handle_pg_create
Samuel Just [Wed, 18 Jul 2012 18:31:09 +0000 (11:31 -0700)]
OSD: actually send queries during handle_pg_create

During the osd threading refactor, we lost the do_queries
call in favor of dispatch_context.  However, this did not
include the queries triggered prior to pg instantiation.
Instead, use the rctx to send the queries.

Part of #2771.  Without the queries being sent,
can_create_pg will never become true.

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agoMerge branch 'next'
Josh Durgin [Wed, 18 Jul 2012 19:58:47 +0000 (12:58 -0700)]
Merge branch 'next'

13 years agoobjecter: always resend linger registrations
Sage Weil [Wed, 18 Jul 2012 19:55:35 +0000 (12:55 -0700)]
objecter: always resend linger registrations

If a linger op (watch) is sent to the OSD and updates the object, and then
the client loses the reply, it will resend the request.  The OSD will see
that it is a dup, however, and not set up the in-memory session state for
the watch.  This in turn will break the watch (i.e., notifies won't
get delivered).

Instead, always resend linger registration ops, so that we always have a
unique reqid and do the correct session registeration for each session.

 * track the tid of the registation op for each LingerOp
 * mark registrations ops as should_resend=false; cancel as needed
 * when we send a new registration op, cancel the old one to ensure we
   ignore the reply.  This is needed becuase we resend linger ops on any
   pg change, not just a primary change.
 * drop the first_send arg to send_linger(), as we can now infer that
   from register_tid == 0.

The bug was easily reproduced with ms inject socket failures = 500 and the
test_stress_watch utility.

Fixes: #2796
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
13 years agoOSD: publish_map in init to initialize OSDService map
Samuel Just [Wed, 18 Jul 2012 16:26:11 +0000 (09:26 -0700)]
OSD: publish_map in init to initialize OSDService map

Other areas rely on OSDService::get_map() to function, possibly before
activate_map is first called.  In particular, with handle_osd_ping,
not initializing the map member results in:

ceph version 0.48argonaut-413-g90ddc5a (commit:90ddc5ae51627e7656459085d7e15105c8b8316d)
 1: /tmp/cephtest/binary/usr/local/bin/ceph-osd() [0x71ba9a]
 2: (()+0xfcb0) [0x7fcd8243dcb0]
 3: (OSD::handle_osd_ping(MOSDPing*)+0x74d) [0x5dbdfd]
 4: (OSD::heartbeat_dispatch(Message*)+0x22b) [0x5dc70b]
 5: (SimpleMessenger::DispatchQueue::entry()+0x92b) [0x7b5b3b]
 6: (SimpleMessenger::dispatch_entry()+0x24) [0x7b6914]
 7: (SimpleMessenger::DispatchThread::entry()+0xd) [0x7762fd]
 8: (()+0x7e9a) [0x7fcd82435e9a]
 9: (clone()+0x6d) [0x7fcd809ea4bd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agoqa/workunits/suites/pjd.sh: bash -x
Sage Weil [Wed, 18 Jul 2012 17:52:33 +0000 (10:52 -0700)]
qa/workunits/suites/pjd.sh: bash -x

This will let us see what test is failing, exactly, and what its inputs
were.  Hoping to help find #2187.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoObjectCacher: fix cache_bytes_hit accounting
Josh Durgin [Wed, 18 Jul 2012 17:24:58 +0000 (10:24 -0700)]
ObjectCacher: fix cache_bytes_hit accounting

Misses are not hits!

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
13 years agodoc: Fixed heading text.
John Wilkins [Wed, 18 Jul 2012 14:35:35 +0000 (07:35 -0700)]
doc: Fixed heading text.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
13 years agodoc: favicon.ico should be new Ceph icon.
John Wilkins [Wed, 18 Jul 2012 14:35:00 +0000 (07:35 -0700)]
doc: favicon.ico should be new Ceph icon.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
13 years agodoc: Overhauled Swift API documentation.
John Wilkins [Wed, 18 Jul 2012 04:28:59 +0000 (21:28 -0700)]
doc: Overhauled Swift API documentation.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
13 years agoMerge branch 'next'
Sage Weil [Wed, 18 Jul 2012 02:20:06 +0000 (19:20 -0700)]
Merge branch 'next'

13 years agoclient: fix readdir locking
Sage Weil [Wed, 18 Jul 2012 02:19:39 +0000 (19:19 -0700)]
client: fix readdir locking

Several of the readdir-related methods were not taking client_lock.

Fixes: #1737
Backport: argonaut
Signed-off-by: Sage Weil <sage@inktank.com>