]> git.apps.os.sepia.ceph.com Git - ceph.git/log
ceph.git
13 years agoosd: fix pg log zeroing
Sage Weil [Tue, 24 Jul 2012 18:02:37 +0000 (11:02 -0700)]
osd: fix pg log zeroing

Zero the right number of bytes.  Fixes a bug where we clobber legit log
data.  Fortunately this is only triggered with osd preserve pg log = false,
which was not the default until recently in master.

Fixes: #2799
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Mike Ryan <mike.ryan@inktank.com>
13 years agolibrbd: replace assign_bid with client id and random number
Josh Durgin [Mon, 23 Jul 2012 21:05:53 +0000 (14:05 -0700)]
librbd: replace assign_bid with client id and random number

The assign_bid method has issues with replay because it is a write
that also returns data. This means that the replayed operation would
return success, but no data, and cause a create to fail. Instead, let
the client set the bid based on its global id and a random number.

This only affects the creation of new images, since the bid is put
into an opaque string as part of the object prefix.

Keep the server side assign_bid around in case there are old clients
still using it.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
13 years agoosd: fix ACK ordering on resent ops
Sage Weil [Mon, 23 Jul 2012 23:51:03 +0000 (16:51 -0700)]
osd: fix ACK ordering on resent ops

The wait_for_ondisk handling fixed COMMIT ordering, but the ACKs need to
go back in the same order too.  For example:

 - op A is queued
 - client disconnects, both ACK and COMMIT replies are lost
 - client reconnects
 - op A and B are sent
 - op A is queued
 - op B is applied, ACK is sent
 - op A and B COMMITs are sent
 -> client's ack callbacks will see B and then A.

Fix this by creating a waiting_for_ack queue as well, and sending ACK
responses as needed.  Also handle the case where the ACK should be sent
immediately when the retry event is received.

Fixes: #2823
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Mike Ryan <mike.ryan@inktank.com>
13 years agoMerge tag 'v0.49'
Sage Weil [Mon, 23 Jul 2012 19:43:19 +0000 (12:43 -0700)]
Merge tag 'v0.49'

v0.49

13 years agov0.49 v0.49
Sage Weil [Sat, 21 Jul 2012 06:26:56 +0000 (23:26 -0700)]
v0.49

13 years agomon: make 'ceph osd rm ...' wipe out all state bits, not just EXISTS
Sage Weil [Mon, 23 Jul 2012 17:47:10 +0000 (10:47 -0700)]
mon: make 'ceph osd rm ...' wipe out all state bits, not just EXISTS

This ensures that when a new osd reclaims that id it behaves as if it were
really new.

Backport: argonaut
Signed-off-by: Sage Weil <sage@inktank.com>
13 years agomkcephfs: add sync between btrfs scan and mount
Sage Weil [Mon, 23 Jul 2012 16:21:09 +0000 (09:21 -0700)]
mkcephfs: add sync between btrfs scan and mount

This appears to fix problems with mount failing for at least one user.

Reported-by: Paul Pettigrew <Paul.Pettigrew@mach.com.au>
Signed-off-by: Sage Weil <sage@inktank.com>
13 years agocrush: fix name map encoding
Sage Weil [Sat, 21 Jul 2012 16:15:06 +0000 (09:15 -0700)]
crush: fix name map encoding

We screwed up and encoded using the name 'int' type instead of int32_t.
That means people have systems encoding this as both 32 and 64 bit,
depending on their architecture.  This could be worse: x86_64 still has a
32-bit int (at least in my environment).

In any case, mixing both word sizes in their clusters is broken as a
result, with the exception of the kernel code, which doesn't decode this
part of the map and will tolerate differently-sized servers.

Fix this by:

 * encoding using int32_t now
 * decoding either 32-bit or 64-bit values, by assuming that the strings
   will always be non-empty.  This appears to be the case.

However:

 * any cluster with 64-bit ints must upgrade all at once, or else the new
   code will start encoding 32-bit values and the old code will be
   confused.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
13 years agoosd/OpTracker: fix use-after-free
Sage Weil [Sat, 21 Jul 2012 15:24:37 +0000 (08:24 -0700)]
osd/OpTracker: fix use-after-free

And formatting.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoOpRequest,OSD: track recent slow ops
Samuel Just [Fri, 20 Jul 2012 00:43:17 +0000 (17:43 -0700)]
OpRequest,OSD: track recent slow ops

This should be helpful while investigating slow performance.

OpRequests now track events with timestamp in addition
to dumping them to the log.  OpHistory keeps up to a
configurable number of the slowest ops over a configurable
recent time interval.  The admin socket interface for the OSD
now has a dump_historic_ops command which dumps the stored
slow ops.

Reviewed-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agoMerge branch 'next'
Samuel Just [Fri, 20 Jul 2012 21:32:44 +0000 (14:32 -0700)]
Merge branch 'next'

13 years agotest/store_test.cc: verify collection_list_partial results are sorted
Samuel Just [Fri, 20 Jul 2012 20:09:39 +0000 (13:09 -0700)]
test/store_test.cc: verify collection_list_partial results are sorted

Synthetic test now also varies snapshots and uses a small variety of
hashes.

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agoos/HashIndex: use set<pair<string, hobject_t>> rather than multimap
Samuel Just [Fri, 20 Jul 2012 19:00:42 +0000 (12:00 -0700)]
os/HashIndex: use set<pair<string, hobject_t>> rather than multimap

Multimap does not make any guarantees about ordering of different
values with the same key.  list_by_hash, however, assumes that
the iterator order matches hobject_t order.  Thus, we use
set<pair<string, hobject_t> > to get the proper ordering.

Backport: stable

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agomon: shut up about sessionless MPGStats messages
Sage Weil [Fri, 20 Jul 2012 01:00:25 +0000 (18:00 -0700)]
mon: shut up about sessionless MPGStats messages

If the mon gets a reset on the client connection, it clears the session
on the connection.  This is perfectly normal to see.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoosd: clean up boot method names
Sage Weil [Fri, 20 Jul 2012 04:27:20 +0000 (21:27 -0700)]
osd: clean up boot method names

Prefix subsequent steps with _.  Better names.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
13 years agoosd: defer boot if heartbeatmap indicates we are unhealthy
Sage Weil [Fri, 20 Jul 2012 04:27:37 +0000 (21:27 -0700)]
osd: defer boot if heartbeatmap indicates we are unhealthy

If the OSD is bogged down or unresponsive, we should not try to join
the cluster.  This was observed on congress (slow/clogged op_tp combined
with osdmap thrashing).

Fixes: #2502
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
13 years agoMerge branch 'next'
Sage Weil [Fri, 20 Jul 2012 03:22:35 +0000 (20:22 -0700)]
Merge branch 'next'

Conflicts:
src/include/ceph_features.h

13 years agoosd/mon: subscribe (onetime) to pg creations on connect
Sage Weil [Thu, 19 Jul 2012 23:47:23 +0000 (16:47 -0700)]
osd/mon: subscribe (onetime) to pg creations on connect

Ask the monitor for pending pg creations each time we connect.

Normally, this is a freebie check.  If there are pending creations, though,
it ensures that the OSD finds out about them even if the original lame
broadcast didn't reach it.  Specifically:

 - osd is hunting for a monitor, but isn't yet connected
 - new pgs are created
 - send_pg_creates() sends out create messages, but osd does get it
 - osd finally connects to a mon

Fixes: #2151 (tho the bug description is bad)
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
13 years agomon: track pg creations by osd
Sage Weil [Wed, 18 Jul 2012 21:54:11 +0000 (14:54 -0700)]
mon: track pg creations by osd

Track the pending pg creations by osd, and use a helper to send out that
messages.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoRevert "rbd: fix usage for snap commands"
Sage Weil [Thu, 19 Jul 2012 23:45:07 +0000 (16:45 -0700)]
Revert "rbd: fix usage for snap commands"

This reverts commit 42de6873f9ca33fc20e70176d9a422635a6f0152.

Actually, these are fine!  Dan made them all kinds of fancy.

13 years agorbd: fix usage for snap commands
Sage Weil [Thu, 19 Jul 2012 23:48:18 +0000 (16:48 -0700)]
rbd: fix usage for snap commands

Snap commands take '--snap <snapname> <imagename>'.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agodoc: add missing dependencies to README
Mike Ryan [Thu, 19 Jul 2012 18:18:19 +0000 (11:18 -0700)]
doc: add missing dependencies to README

Signed-off-by: Mike Ryan <mike.ryan@inktank.com>
13 years agoadd CRUSH_TUNABLES feature bit
Sage Weil [Thu, 19 Jul 2012 02:49:58 +0000 (19:49 -0700)]
add CRUSH_TUNABLES feature bit

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoOSD::handle_osd_map: don't lock pgs while advancing maps
Samuel Just [Wed, 18 Jul 2012 22:37:28 +0000 (15:37 -0700)]
OSD::handle_osd_map: don't lock pgs while advancing maps

We no longer do anything with the pgs here.  PG map
advancing is now handled in OSD::advance_pg asyncronously.

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agoosd: add osd_debug_drop_pg_create_{probability,duration} options
Sage Weil [Wed, 18 Jul 2012 19:20:24 +0000 (12:20 -0700)]
osd: add osd_debug_drop_pg_create_{probability,duration} options

This will let us exercise more of the pg creation code.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoOSD: write_if_dirty during get_or_create_pg after handle_create
Samuel Just [Wed, 18 Jul 2012 19:48:09 +0000 (12:48 -0700)]
OSD: write_if_dirty during get_or_create_pg after handle_create

In the case that the pg is newly created, we will activate during
that call, so the info and log will be dirty.

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agoOSD: actually send queries during handle_pg_create
Samuel Just [Wed, 18 Jul 2012 18:31:09 +0000 (11:31 -0700)]
OSD: actually send queries during handle_pg_create

During the osd threading refactor, we lost the do_queries
call in favor of dispatch_context.  However, this did not
include the queries triggered prior to pg instantiation.
Instead, use the rctx to send the queries.

Part of #2771.  Without the queries being sent,
can_create_pg will never become true.

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agoMerge branch 'next'
Josh Durgin [Wed, 18 Jul 2012 19:58:47 +0000 (12:58 -0700)]
Merge branch 'next'

13 years agoobjecter: always resend linger registrations
Sage Weil [Wed, 18 Jul 2012 19:55:35 +0000 (12:55 -0700)]
objecter: always resend linger registrations

If a linger op (watch) is sent to the OSD and updates the object, and then
the client loses the reply, it will resend the request.  The OSD will see
that it is a dup, however, and not set up the in-memory session state for
the watch.  This in turn will break the watch (i.e., notifies won't
get delivered).

Instead, always resend linger registration ops, so that we always have a
unique reqid and do the correct session registeration for each session.

 * track the tid of the registation op for each LingerOp
 * mark registrations ops as should_resend=false; cancel as needed
 * when we send a new registration op, cancel the old one to ensure we
   ignore the reply.  This is needed becuase we resend linger ops on any
   pg change, not just a primary change.
 * drop the first_send arg to send_linger(), as we can now infer that
   from register_tid == 0.

The bug was easily reproduced with ms inject socket failures = 500 and the
test_stress_watch utility.

Fixes: #2796
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
13 years agoOSD: publish_map in init to initialize OSDService map
Samuel Just [Wed, 18 Jul 2012 16:26:11 +0000 (09:26 -0700)]
OSD: publish_map in init to initialize OSDService map

Other areas rely on OSDService::get_map() to function, possibly before
activate_map is first called.  In particular, with handle_osd_ping,
not initializing the map member results in:

ceph version 0.48argonaut-413-g90ddc5a (commit:90ddc5ae51627e7656459085d7e15105c8b8316d)
 1: /tmp/cephtest/binary/usr/local/bin/ceph-osd() [0x71ba9a]
 2: (()+0xfcb0) [0x7fcd8243dcb0]
 3: (OSD::handle_osd_ping(MOSDPing*)+0x74d) [0x5dbdfd]
 4: (OSD::heartbeat_dispatch(Message*)+0x22b) [0x5dc70b]
 5: (SimpleMessenger::DispatchQueue::entry()+0x92b) [0x7b5b3b]
 6: (SimpleMessenger::dispatch_entry()+0x24) [0x7b6914]
 7: (SimpleMessenger::DispatchThread::entry()+0xd) [0x7762fd]
 8: (()+0x7e9a) [0x7fcd82435e9a]
 9: (clone()+0x6d) [0x7fcd809ea4bd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agoqa/workunits/suites/pjd.sh: bash -x
Sage Weil [Wed, 18 Jul 2012 17:52:33 +0000 (10:52 -0700)]
qa/workunits/suites/pjd.sh: bash -x

This will let us see what test is failing, exactly, and what its inputs
were.  Hoping to help find #2187.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoObjectCacher: fix cache_bytes_hit accounting
Josh Durgin [Wed, 18 Jul 2012 17:24:58 +0000 (10:24 -0700)]
ObjectCacher: fix cache_bytes_hit accounting

Misses are not hits!

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
13 years agodoc: Fixed heading text.
John Wilkins [Wed, 18 Jul 2012 14:35:35 +0000 (07:35 -0700)]
doc: Fixed heading text.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
13 years agodoc: favicon.ico should be new Ceph icon.
John Wilkins [Wed, 18 Jul 2012 14:35:00 +0000 (07:35 -0700)]
doc: favicon.ico should be new Ceph icon.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
13 years agodoc: Overhauled Swift API documentation.
John Wilkins [Wed, 18 Jul 2012 04:28:59 +0000 (21:28 -0700)]
doc: Overhauled Swift API documentation.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
13 years agoMerge branch 'next'
Sage Weil [Wed, 18 Jul 2012 02:20:06 +0000 (19:20 -0700)]
Merge branch 'next'

13 years agoclient: fix readdir locking
Sage Weil [Wed, 18 Jul 2012 02:19:39 +0000 (19:19 -0700)]
client: fix readdir locking

Several of the readdir-related methods were not taking client_lock.

Fixes: #1737
Backport: argonaut
Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoclient: fix leak of client_lock when not initialized
Sage Weil [Tue, 17 Jul 2012 19:38:50 +0000 (12:38 -0700)]
client: fix leak of client_lock when not initialized

Backport: argonaut
Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoOSD: use service.get_osdmap() in heartbeat(), don't grab map_lock
Samuel Just [Tue, 17 Jul 2012 23:35:43 +0000 (16:35 -0700)]
OSD: use service.get_osdmap() in heartbeat(), don't grab map_lock

service.get_osdmap() gives us sufficiently consist
access to the map state.

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agoOSD: handle_osd_ping: use service->get_osdmap()
Samuel Just [Tue, 17 Jul 2012 23:20:38 +0000 (16:20 -0700)]
OSD: handle_osd_ping: use service->get_osdmap()

This way, we avoid grabbing the map_lock.  Furthermore,
get curmap at the beginning of the method to ensure that
we send the message using the same map used to check
is_up.

This should also fix #2798, which was caused by
an osd being marked up between service.get_osdmap()
and OSD::osdmap.

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agodoc/dev/osd_internals: add newlines before numbered lists
Samuel Just [Tue, 17 Jul 2012 23:09:10 +0000 (16:09 -0700)]
doc/dev/osd_internals: add newlines before numbered lists

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agolibrados: simplify locking slightly
Sage Weil [Tue, 17 Jul 2012 23:01:11 +0000 (16:01 -0700)]
librados: simplify locking slightly

No reason to hold mylock_all here.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoosd: default 'osd_preserve_trimmed_log = false'
Sage Weil [Tue, 17 Jul 2012 19:38:40 +0000 (12:38 -0700)]
osd: default 'osd_preserve_trimmed_log = false'

This option makes the osd skip zeroing old trimmed regions of the log.  The
data is never read, since the xattrs indicate which part of the log is
valid.  We've never actually used this to debug a problem, and it consumes
space, so let's disable it.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agodoc/dev: add osd_internals to toc
Samuel Just [Tue, 17 Jul 2012 16:42:43 +0000 (09:42 -0700)]
doc/dev: add osd_internals to toc

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agodoc/internals/osd_internals: fix indentation errors
Samuel Just [Tue, 17 Jul 2012 16:31:22 +0000 (09:31 -0700)]
doc/internals/osd_internals: fix indentation errors

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agodoc: discuss choice of pg_num
Sage Weil [Mon, 16 Jul 2012 23:44:05 +0000 (16:44 -0700)]
doc: discuss choice of pg_num

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agolog: simplify log logic a bit
Sage Weil [Mon, 16 Jul 2012 23:18:51 +0000 (16:18 -0700)]
log: simplify log logic a bit

Whether an entry is eligible to log/dump is independent of the channel it
is sent to.  Some channels impose additional restrictions.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoMerge branch 'next'
Josh Durgin [Tue, 17 Jul 2012 00:36:06 +0000 (17:36 -0700)]
Merge branch 'next'

13 years agoRobustify ceph-rbdnamer and adapt udev rules
Pascal de Bruijn | Unilogic Networks B.V [Wed, 11 Jul 2012 13:23:16 +0000 (15:23 +0200)]
Robustify ceph-rbdnamer and adapt udev rules

Below is a patch which makes the ceph-rbdnamer script more robust and
fixes a problem with the rbd udev rules.

On our setup we encountered a symlink which was linked to the wrong rbd:

  /dev/rbd/mypool/myrbd -> /dev/rbd1

While that link should have gone to /dev/rbd3 (on which a
partition /dev/rbd3p1 was present).

Now the old udev rule passes %n to the ceph-rbdnamer script, the problem
with %n is that %n results in a value of 3 (for rbd3), but in a value of
1 (for rbd3p1), so it seems it can't be depended upon for rbdnaming.

In the patch below the ceph-rbdnamer script is made more robust and it
now it can be called in various ways:

  /usr/bin/ceph-rbdnamer /dev/rbd3
  /usr/bin/ceph-rbdnamer /dev/rbd3p1
  /usr/bin/ceph-rbdnamer rbd3
  /usr/bin/ceph-rbdnamer rbd3p1
  /usr/bin/ceph-rbdnamer 3

Even with all these different styles of calling the modified script, it
should now return the same rbdname. This change "has" to be combined
with calling it from udev with %k though.

With that fixed, we hit the second problem. We ended up with:

  /dev/rbd/mypool/myrbd -> /dev/rbd3p1

So the rbdname was symlinked to the partition on the rbd instead of the
rbd itself. So what probably went wrong is udev discovering the disk and
running ceph-rbdnamer which resolved it to myrbd so the following
symlink was created:

  /dev/rbd/mypool/myrbd -> /dev/rbd3

However partitions would be discovered next and ceph-rbdnamer would be
run with rbd3p1 (%k) as parameter, resulting in the name myrbd too, with
the previous correct symlink being overwritten with a faulty one:

  /dev/rbd/mypool/myrbd -> /dev/rbd3p1

The solution to the problem is in differentiating between disks and
partitions in udev and handling them slightly differently. So with the
patch below partitions now get their own symlinks in the following style
(which is fairly consistent with other udev rules):

  /dev/rbd/mypool/myrbd-part1 -> /dev/rbd3p1

Please let me know any feedback you have on this patch or the approach
used.

Regards,
Pascal de Bruijn
Unilogic B.V.

Signed-off-by: Pascal de Bruijn <pascal@unilogicnetworks.net>
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
13 years agodoc/radosgw/config.rst: mended small typo
caleb miles [Mon, 16 Jul 2012 23:30:36 +0000 (16:30 -0700)]
doc/radosgw/config.rst: mended small typo

Signed-off-by: caleb miles <caleb.miles@inktank.com>
13 years agoMerge branch 'next'
Sage Weil [Mon, 16 Jul 2012 23:13:55 +0000 (16:13 -0700)]
Merge branch 'next'

13 years agoMerge branch 'wip-mon-mkfs'
Sage Weil [Mon, 16 Jul 2012 23:15:33 +0000 (16:15 -0700)]
Merge branch 'wip-mon-mkfs'

Reviewed-by: Tommi Virtanen <tv@inktank.com>
13 years agomkcephfs: nicer empty directory check
Sage Weil [Mon, 16 Jul 2012 23:10:57 +0000 (16:10 -0700)]
mkcephfs: nicer empty directory check

From TV.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agomkcephfs: error out if mon data directory is not empty
Sage Weil [Tue, 10 Jul 2012 01:16:44 +0000 (18:16 -0700)]
mkcephfs: error out if mon data directory is not empty

The ceph-mon --mkfs function no longer wipes out the directory; it is in
fact mostly a no-op that just verifies the dir exists.

So, ensure that the directory is empty at mkfs time.  This could
alternatively do an 'rm -r' in that directory (that is in fact what
ceph-mon used to do), but this is safer.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agovstart.sh: blow away mon directory on creation/start
Sage Weil [Tue, 10 Jul 2012 01:17:54 +0000 (18:17 -0700)]
vstart.sh: blow away mon directory on creation/start

Now that ceph-mon doesn't blow away the mon data content, we need to.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agomon: stop doing rm -rf on mon mkfs
Sage Weil [Tue, 10 Jul 2012 01:17:16 +0000 (18:17 -0700)]
mon: stop doing rm -rf on mon mkfs

Simply verify that the directory exists, or if it doesn't, create it.
Do nothing about its content.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agolog: apply log_level to stderr/syslog logic
Sage Weil [Mon, 16 Jul 2012 23:02:14 +0000 (16:02 -0700)]
log: apply log_level to stderr/syslog logic

In non-crash situations, we want to make sure the message is both below the
syslog/stderr threshold and also below the normal log threshold.  Otherwise
we get anything we gather on those channels, even when the log level is
low.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agolog: dump logging levels in crash dump
Sage Weil [Mon, 16 Jul 2012 22:40:03 +0000 (15:40 -0700)]
log: dump logging levels in crash dump

So you know what you are/are not seeing.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoMerge branch 'next'
Sage Weil [Mon, 16 Jul 2012 22:53:54 +0000 (15:53 -0700)]
Merge branch 'next'

13 years agoPG: grab reference to pg in C_OSD_AppliedRecoveredObject
Samuel Just [Mon, 16 Jul 2012 22:43:47 +0000 (15:43 -0700)]
PG: grab reference to pg in C_OSD_AppliedRecoveredObject

Otherwise, accessing the pg via _applied_recovered_object
isn't safe.  Using intrusive_ptr clarifies the reference
ownership.

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agolog: fix event gather condition
Sage Weil [Mon, 16 Jul 2012 22:40:53 +0000 (15:40 -0700)]
log: fix event gather condition

We should gather an event if it is below the log or gather threshold.

Previously we were only gathering if we were going to print it, which makes
the dump no more useful than what was already logged.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoPG::RecoveryState::Stray::react(LogEvt&): set dirty_info/log
Samuel Just [Mon, 16 Jul 2012 20:14:43 +0000 (13:14 -0700)]
PG::RecoveryState::Stray::react(LogEvt&): set dirty_info/log

We adjust the info and the log, so we must set dirty_info and
dirty_log to force writes.

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agoPG: use stats from primary after rewinding divergent entries
Samuel Just [Mon, 16 Jul 2012 20:07:56 +0000 (13:07 -0700)]
PG: use stats from primary after rewinding divergent entries

If the osd recieving the info has divergent entries, it will
also have a "divergent" stat structure.

Probably fixes #2769.

In cases like #2769, this bug can result in a primary with a stat
structure which double counts an operation: once for the
divergent operation, and once for the replay.

This is another way for the bug addressed in
5924f8e4a8c29e6de326a9e8576c30109cdc0e07 to happen.

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agoMerge remote-tracking branch 'upstream/next'
Samuel Just [Mon, 16 Jul 2012 21:18:04 +0000 (14:18 -0700)]
Merge remote-tracking branch 'upstream/next'

13 years agoPG::RecoveryState::Stray::react(LogEvt&): reset last_pg_scrub
Samuel Just [Mon, 16 Jul 2012 20:11:24 +0000 (13:11 -0700)]
PG::RecoveryState::Stray::react(LogEvt&): reset last_pg_scrub

We need to reset the last_pg_scrub data in the osd since we
are replacing the info.

Probably fixes #2453

In cases like 2453, we hit the following backtrace:

     0> 2012-05-19 17:24:09.113684 7fe66be3d700 -1 osd/OSD.h: In function 'void OSD::unreg_last_pg_scrub(pg_t, utime_t)' thread 7fe66be3d700 time 2012-05-19 17:24:09.095719
osd/OSD.h: 840: FAILED assert(last_scrub_pg.count(p))

 ceph version 0.46-313-g4277d4d (commit:4277d4d3378dde4264e2b8d211371569219c6e4b)
 1: (OSD::unreg_last_pg_scrub(pg_t, utime_t)+0x149) [0x641f49]
 2: (PG::proc_primary_info(ObjectStore::Transaction&, pg_info_t const&)+0x5e) [0x63383e]
 3: (PG::RecoveryState::ReplicaActive::react(PG::RecoveryState::MInfoRec const&)+0x4a) [0x633eda]
 4: (boost::statechart::detail::reaction_result boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl<boost::mpl::list3<boost::statechart::custom_reaction<PG::RecoveryState::MQuery>, boost::statechart::custom_reaction<PG::RecoveryState::MInfoRec>, boost::statechart::custom_reaction<PG::RecoveryState::MLogRec> >, boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0> >(boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>&, boost::statechart::event_base const&, void const*)+0x130) [0x6466a0]
 5: (boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x81) [0x646791]
 6: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x63dfcb]
 7: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x11) [0x63e0f1]
 8: (PG::RecoveryState::handle_info(int, pg_info_t&, PG::RecoveryCtx*)+0x177) [0x616987]
 9: (OSD::handle_pg_info(std::tr1::shared_ptr<OpRequest>)+0x665) [0x5d3d15]
 10: (OSD::dispatch_op(std::tr1::shared_ptr<OpRequest>)+0x2a0) [0x5d7370]
 11: (OSD::_dispatch(Message*)+0x191) [0x5dd4a1]
 12: (OSD::ms_dispatch(Message*)+0x153) [0x5ddda3]
 13: (SimpleMessenger::dispatch_entry()+0x863) [0x77fbc3]
 14: (SimpleMessenger::DispatchThread::entry()+0xd) [0x746c5d]
 15: (()+0x7efc) [0x7fe679b1fefc]
 16: (clone()+0x6d) [0x7fe67815089d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Because we don't clear the scrub state before reseting info,
the last_scrub_stamp state in the info.history structure
changes without updating the osd state resulting in the
above assert failure.

Backport: stable

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agodoc/dev/osd_internals: OSD overview, pg removal, map/message handling
Samuel Just [Wed, 11 Jul 2012 00:52:21 +0000 (17:52 -0700)]
doc/dev/osd_internals: OSD overview, pg removal, map/message handling

This is a start on some osd internals documentation for new
developers.

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agoPG: Place info in biginfo object
Samuel Just [Fri, 22 Jun 2012 17:11:38 +0000 (10:11 -0700)]
PG: Place info in biginfo object

The purged_snaps set can grow without bound as snaps are
created and removed.  Because the filestore doesn't
provide unlimited size collection attributes, it's better
to place the full info on the biginfo object, since we
need to write it during write_info anyway.

Added CEPH_OSD_FEATURE_INCOMPAT_BIGINFO to prevent downgrade.

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agoPG: use write_info to set snap_collections in make_snap_collections
Samuel Just [Fri, 29 Jun 2012 20:39:49 +0000 (13:39 -0700)]
PG: use write_info to set snap_collections in make_snap_collections

At one point, snap_collections were written to a pg collection
attribute.  Subsequently, they were moved to the biginfo object
since the structure can grow too large for limited size xattrs.
make_snap_collection, however, was not updated.

Using write_info here should prevent this from happening in
the future.

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agoOSD: set superblock compat_features on boot and mkfs
Samuel Just [Fri, 13 Jul 2012 23:44:33 +0000 (16:44 -0700)]
OSD: set superblock compat_features on boot and mkfs

Previously, we did not actually persist the osd compatibility
mask.  Without persisting the current compat mask, a previous,
incompatible version of the OSD would not be prevented from
starting on the same store.

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agoCompatSet: users pass bit indices rather than masks
Samuel Just [Fri, 13 Jul 2012 21:23:27 +0000 (14:23 -0700)]
CompatSet: users pass bit indices rather than masks

CompatSet users number the Feature objects rather than
providing masks.  Thus, we should do

mask |= (1 << f.id) rather than mask |= f.id.

In order to detect old, broken encodings, the lowest
bit will be set in memory but not set in the encoding.
We can reconstruct the correct mask from the names map.

This bug can cause an incompat bit to not be detected
since 1|2 == 1|2|3.

fixes: #2748

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agoosd: based misdirected op role calc on acting set
Sage Weil [Sat, 14 Jul 2012 21:31:34 +0000 (14:31 -0700)]
osd: based misdirected op role calc on acting set

We want to look at the acting set here, nothing else.  This was causing us
to erroneously queue ops for later (wasting memory) and to erroneously
print out a 'misdrected op' message in the cluster log (confusion and
incorrect [but ignored] -ENXIO reply).

Fixes: #2022
Signed-off-by: Sage Weil <sage@inktank.com>
13 years agomon/MonitorStore: always O_TRUNC when writing states
Sage Weil [Mon, 16 Jul 2012 03:30:34 +0000 (20:30 -0700)]
mon/MonitorStore: always O_TRUNC when writing states

It is possible for a .new file to already exist, potentially with a
larger size.  This would happen if:

 - we were proposing a different value
 - we crashed (or were stopped) before it got renamed into place
 - after restarting, a different value was proposed and accepted.

This isn't so unlikely for the log state machine, where we're
aggregating random messages.  O_TRUNC ensure we avoid getting the tail
end of some previous junk.

I observed #2593 and found that a logm state value had a larger size on
one mon (after slurping) than the others, pointing to put_bl_sn_map().

While we are at it, O_TRUNC put_int() too; the same type of bug is
possible there, too.

Fixes: #2593
Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoMerge remote-tracking branch 'gh/bugfix-2022'
Sage Weil [Mon, 16 Jul 2012 17:48:25 +0000 (10:48 -0700)]
Merge remote-tracking branch 'gh/bugfix-2022'

Reviewed-by: Samuel Just <sam.just@inktank.com>
13 years agoMerge remote-tracking branch 'gh/bugfix-2779'
Sage Weil [Mon, 16 Jul 2012 16:12:09 +0000 (09:12 -0700)]
Merge remote-tracking branch 'gh/bugfix-2779'

Reviewed-by: Greg Farnum <greg@inktank.com>
13 years agomon: remove osds from [near]full sets when their stats are removed from pgmap
Sage Weil [Mon, 16 Jul 2012 05:03:31 +0000 (22:03 -0700)]
mon: remove osds from [near]full sets when their stats are removed from pgmap

Greg points out that we could have a situation like:

 - mon recovers..
 - goes through osdmaps, notes an osd was removed and removes from
   full/nearfull
 - goes through pgmaps, and re-adds it when it encounters some osd_stat_ts.

Fix this by removing the osd from the full/nearfull set when we remove
the osd_stat_t from the pgmap.  Any osd removal is always followed by
an osd_stat_rm[] record when the primary processes the new osdmap and
proposed the appropriate pgmap updates.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agomon/MonitorStore: always O_TRUNC when writing states
Sage Weil [Mon, 16 Jul 2012 03:30:34 +0000 (20:30 -0700)]
mon/MonitorStore: always O_TRUNC when writing states

It is possible for a .new file to already exist, potentially with a
larger size.  This would happen if:

 - we were proposing a different value
 - we crashed (or were stopped) before it got renamed into place
 - after restarting, a different value was proposed and accepted.

This isn't so unlikely for the log state machine, where we're
aggregating random messages.  O_TRUNC ensure we avoid getting the tail
end of some previous junk.

I observed #2593 and found that a logm state value had a larger size on
one mon (after slurping) than the others, pointing to put_bl_sn_map().

While we are at it, O_TRUNC put_int() too; the same type of bug is
possible there, too.

Fixes: #2593
Signed-off-by: Sage Weil <sage@inktank.com>
13 years agofilestore: dump open fds when we hit EMFILE
Sage Weil [Sun, 15 Jul 2012 22:21:57 +0000 (15:21 -0700)]
filestore: dump open fds when we hit EMFILE

Use a helper to dump /proc/self/fd when we hit EMFILE in the filestore.
Ideally, we should trigger this in other appropriate places, but it is
not immediately clear that there is a sane way to do that.

Fixes: #2330
Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoosdmap: drop useless and unused get_pg_role() method
Sage Weil [Sat, 14 Jul 2012 21:32:28 +0000 (14:32 -0700)]
osdmap: drop useless and unused get_pg_role() method

Users probably want get_pg_acting_rank().  If they don't, they can probably
have the mapping and can calculate the rank themselves.  Having this here
is asking for bugs like #2022.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoosd: based misdirected op role calc on acting set
Sage Weil [Sat, 14 Jul 2012 21:31:34 +0000 (14:31 -0700)]
osd: based misdirected op role calc on acting set

We want to look at the acting set here, nothing else.  This was causing us
to erroneously queue ops for later (wasting memory) and to erroneously
print out a 'misdrected op' message in the cluster log (confusion and
incorrect [but ignored] -ENXIO reply).

Fixes: #2022
Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoosd: simplify helper usage for misdirected ops
Sage Weil [Sat, 14 Jul 2012 21:29:29 +0000 (14:29 -0700)]
osd: simplify helper usage for misdirected ops

Make the helper exclusively for the PG != NULL cases, and open-code the
one PG == NULL caller.  This is simpler, and lets us include more useful
information in the log message.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agovstart: use absolute path for keyring
Noah Watkins [Sat, 14 Jul 2012 00:29:56 +0000 (17:29 -0700)]
vstart: use absolute path for keyring

Stores absolute path to the generated keyring so that tests running in
other directories (e.g. src/java/test) can simply reference the
generated ceph.conf.

Signed-off-by: Noah Watkins <jawhawk@cs.ucsc.edu>
13 years agoOSD: add config options to fake missed pings
Samuel Just [Fri, 13 Jul 2012 20:45:24 +0000 (13:45 -0700)]
OSD: add config options to fake missed pings

In order to test monitor and osd failure detection and false
positive correction, this patch adds the following options:

 1. osd_debug_drop_ping_probability: probability of dropping
    a string of pings from a client upon ping recipt.
 2. osd_debug_drop_ping_duration: number of pings to drop in
    a row.

This should help with replicating some wrongly-marked-down
thrashing cases.

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agocrushtool: allow information generated during testing to be dumped
caleb miles [Fri, 6 Jul 2012 00:30:01 +0000 (17:30 -0700)]
crushtool: allow information generated during testing to be dumped
to a set of CSV files for off-line analysis.

Signed-off-by: caleb miles <caleb.miles@inktank.com>
13 years agodoc: remove last reference to ceph-cookbooks.
John Wilkins [Fri, 13 Jul 2012 21:16:08 +0000 (14:16 -0700)]
doc: remove last reference to ceph-cookbooks.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
13 years agodoc: cookbooks issue resolved, so changed 'ceph-cookbooks' back to 'ceph.'
John Wilkins [Fri, 13 Jul 2012 21:08:41 +0000 (14:08 -0700)]
doc: cookbooks issue resolved, so changed 'ceph-cookbooks' back to 'ceph.'

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
13 years agoqa: download tests from specified branch
Josh Durgin [Fri, 13 Jul 2012 16:42:20 +0000 (09:42 -0700)]
qa: download tests from specified branch

These python tests aren't installed, so they need to be downloaded

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
13 years agoOSD: send_still_alive when we get a reply if we reported failure
Samuel Just [Fri, 13 Jul 2012 16:20:02 +0000 (09:20 -0700)]
OSD: send_still_alive when we get a reply if we reported failure

When we get a ping reply, remove the peer from the failure_queue
and send a still alive message if the peer is in the failure_pending
map.

Otherwise, the monitor could slowly accumulate sporadic failure reports
leading to an osd being incorrectly marked out.

This bug may have been contributing to the wrongly-marked-down
thrashing observed on some systems.

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agoPG: merge_log always use stats from authoritative replica
Samuel Just [Fri, 13 Jul 2012 00:19:43 +0000 (17:19 -0700)]
PG: merge_log always use stats from authoritative replica

If the osd recieving the log has divergent entries, it will
also have a "divergent" stat structure.  In general, it suffices
to simply trust the stat structure shipped with the authoritative
log and info since merge_log is only used to merge an authoritative
log.

Probably fixes #2769.

In cases like #2769, this bug can result in a primary with a stat
structure which double counts an operation: once for the
divergent operation, and once for the replay.  It turned up
in a regression suite run as a scrub stat mismatch.

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agoqa: download tests from specified branch
Josh Durgin [Fri, 13 Jul 2012 16:42:20 +0000 (09:42 -0700)]
qa: download tests from specified branch

These python tests aren't installed, so they need to be downloaded

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
13 years agomon: use single helper for [near]full sets
Sage Weil [Fri, 13 Jul 2012 14:27:36 +0000 (07:27 -0700)]
mon: use single helper for [near]full sets

Use a single helper to add/remove osds from the [near]full sets.  This
keeps the logic in a single place, and simplifies the code somewhat.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agomon: purge removed osds from [near]full sets
Sage Weil [Fri, 13 Jul 2012 13:10:07 +0000 (06:10 -0700)]
mon: purge removed osds from [near]full sets

The [near]full sets are volatile state.  Remove removed (or created)
osds from the set when we process a map.

Fixes: #2779
Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoReplicatedPG: don't mark repop done until apply completes
Samuel Just [Thu, 12 Jul 2012 23:45:26 +0000 (16:45 -0700)]
ReplicatedPG: don't mark repop done until apply completes

Consider the following sequence:
1. issue, apply repop
2. replicas and primary commit
  Here, repop->waitfor_(ack|disk) are empty, so we mark
  repop->done and remove_repop.
3. interval change, repops still in queue are marked aborted
4. activate, last_update_applied = last_update
5. the repop from one enters apply_repop, is not aborted,
   and finds that last_update_applied has passed it by.

Fixes #2749

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agotest_librbd: fix warnings
Sage Weil [Thu, 12 Jul 2012 23:14:33 +0000 (16:14 -0700)]
test_librbd: fix warnings

test/test_librbd.cc: In member function ‘virtual void LibRBD_TestClone_Test::TestBody()’:
warning: test/test_librbd.cc:1040:111: format ‘%ld’ expects argument of type ‘long int’, but argument 2 has type ‘uint64_t {aka long long unsigned int}’ [-Wformat]
warning: test/test_librbd.cc:1040:111: format ‘%ld’ expects argument of type ‘long int’, but argument 3 has type ‘uint64_t {aka long long unsigned int}’ [-Wformat]
warning: test/test_librbd.cc:1040:111: format ‘%ld’ expects argument of type ‘long int’, but argument 4 has type ‘int64_t {aka long long int}’ [-Wformat]

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoReplicatedPG,PG: dump recovery/backfill state on pg query
Samuel Just [Fri, 6 Jul 2012 21:38:57 +0000 (14:38 -0700)]
ReplicatedPG,PG: dump recovery/backfill state on pg query

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agoMerge remote-tracking branch 'gh/wip-2101'
Sage Weil [Thu, 12 Jul 2012 20:11:33 +0000 (13:11 -0700)]
Merge remote-tracking branch 'gh/wip-2101'

13 years agorbd: enable layering when using the new format
Josh Durgin [Thu, 12 Jul 2012 18:13:47 +0000 (11:13 -0700)]
rbd: enable layering when using the new format

We'll add options for different features later.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
13 years agodoc: reverted file and role names.
John Wilkins [Thu, 12 Jul 2012 18:46:43 +0000 (11:46 -0700)]
doc: reverted file and role names.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
13 years agoupstart: Make ceph-osd always set the crush location.
Tommi Virtanen [Thu, 12 Jul 2012 17:47:29 +0000 (10:47 -0700)]
upstart: Make ceph-osd always set the crush location.

This used to be conditional on config having osd_crush_location set,
but with that, minimal configuration left the OSD completely out of
the crush map, and prevented the OSD from starting properly.

Note: Ceph does not currently let this mechanism automatically move
hosts to another location in the CRUSH hierarchy. This means if you
let this run with defaults, setting osd_crush_location later will not
take effect. Set up your config file (or Chef environment) fully
before starting the OSDs the first time.

Signed-off-by: Tommi Virtanen <tv@inktank.com>
13 years agodoc: fix config metavariables discussion
Sage Weil [Thu, 12 Jul 2012 16:57:31 +0000 (09:57 -0700)]
doc: fix config metavariables discussion

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agodoc: perf counters
Sage Weil [Thu, 12 Jul 2012 16:51:24 +0000 (09:51 -0700)]
doc: perf counters

Signed-off-by: Sage Weil <sage@inktank.com>