]> git.apps.os.sepia.ceph.com Git - ceph.git/log
ceph.git
12 years agoMerge remote-tracking branch 'gh/next'
Sage Weil [Sun, 2 Jun 2013 21:48:04 +0000 (14:48 -0700)]
Merge remote-tracking branch 'gh/next'

12 years agomon: fix uninitialized fields in MMonHealth
Sage Weil [Sat, 1 Jun 2013 04:16:54 +0000 (21:16 -0700)]
mon: fix uninitialized fields in MMonHealth

Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomon: start lease timer from peon_init()
Sage Weil [Sat, 1 Jun 2013 00:09:19 +0000 (17:09 -0700)]
mon: start lease timer from peon_init()

In the scenario:

 - leader wins, peons lose
 - leader sees it is too far behind on paxos and bootstraps
 - leader tries to sync with someone, waits for a quorum of the others
 - peons sit around forever waiting

The problem is that they never time out because paxos never issues a lease,
which is the normal timeout that lets them detect a leader failure.

Avoid this by starting the lease timeout as soon as we lose the election.
The timeout callback just does a bootstrap and does not rely on any other
state.

I see one possible danger here: there may be some "normal" cases where the
leader takes a long time to issue its first lease that we currently
tolerate, but won't with this new check in place.  I hope that raising
the lease interval/timeout or reducing the allowed paxos drift will make
that a non-issue.  If it is problematic, we will need a separate explicit
"i am alive" from the leader while it is getting ready to issue the lease
to prevent a live-lock.

Backport: cuttlefish, bobtail
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomon: discard messages from disconnected clients
Sage Weil [Fri, 31 May 2013 05:52:21 +0000 (22:52 -0700)]
mon: discard messages from disconnected clients

If the client is not connected, discard the message.  They will
reconnect and resend anyway, so there is no point in processing it
twice (now and later).

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomon/Paxos: adjust trimming defaults up; rename options
Sage Weil [Thu, 30 May 2013 22:59:49 +0000 (15:59 -0700)]
mon/Paxos: adjust trimming defaults up; rename options

- trim more at a time (by an order of magnitude)
- rename fields to paxos_trim_{min,max}; only trim when there are min items
  that are trimmable, and trim at most max items at a time.
- adjust the paxos_service_trim_{min,max} values up by a factor of 2.

Since we are compacting every time we trim, adjusting these up mean less
frequent compactions and less overall work for the monitor.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agoOSD: *inodes_hard_limit must be less than the fd limit
Samuel Just [Fri, 31 May 2013 22:11:02 +0000 (15:11 -0700)]
OSD: *inodes_hard_limit must be less than the fd limit

Also add a comment explaining that.

Fixes: #5224
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agoOSD: tell them they died if they don't exist as well
Samuel Just [Fri, 31 May 2013 21:59:27 +0000 (14:59 -0700)]
OSD: tell them they died if they don't exist as well

OSDMap::get_down_at() asserts that the osd exists.

Fixes: #5223
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
12 years agoMerge branch 'wip-osd-leaks' into next
Sage Weil [Fri, 31 May 2013 21:48:51 +0000 (14:48 -0700)]
Merge branch 'wip-osd-leaks' into next

Reviewed-by: David Zafman <david.zafman@inktank.com>
12 years agoosd: fix msg leak on shutdown in ms_dispatch
Sage Weil [Fri, 31 May 2013 21:46:54 +0000 (14:46 -0700)]
osd: fix msg leak on shutdown in ms_dispatch

Reported-by: David Zafman <david.zafman@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: reset heartbeat peers during shutdown
Sage Weil [Fri, 31 May 2013 05:12:04 +0000 (22:12 -0700)]
osd: reset heartbeat peers during shutdown

This fixes a leak of the Connection's and related structures.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomon/MonClient: fix leak of MMonGetVersionReply
Sage Weil [Fri, 31 May 2013 05:04:48 +0000 (22:04 -0700)]
mon/MonClient: fix leak of MMonGetVersionReply

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: fix leak of MOSDMarkMeDown
Sage Weil [Fri, 31 May 2013 05:02:07 +0000 (22:02 -0700)]
osd: fix leak of MOSDMarkMeDown

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoMerge pull request #338 from alram/next
Sage Weil [Fri, 31 May 2013 19:47:24 +0000 (12:47 -0700)]
Merge pull request #338 from alram/next

Reviewed-by: Sage Weil <sage@inktank.com>
12 years agoupstart: handle upper case in cluster name and id 338/head
Alexandre Marangone [Fri, 31 May 2013 19:33:11 +0000 (12:33 -0700)]
upstart: handle upper case in cluster name and id

Signed-off-by: Alexandre Marangone <alexandre.marangone@inktank.com>
12 years agodoc: Added Java example for setting protocol to HTTP.
John Wilkins [Fri, 31 May 2013 18:15:20 +0000 (11:15 -0700)]
doc: Added Java example for setting protocol to HTTP.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
12 years agodoc: Text of diagram for osd_throttles.
John Wilkins [Fri, 31 May 2013 18:14:29 +0000 (11:14 -0700)]
doc: Text of diagram for osd_throttles.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
12 years agodoc: Omitted text diagram, and used literal include to text file.
John Wilkins [Fri, 31 May 2013 18:14:04 +0000 (11:14 -0700)]
doc: Omitted text diagram, and used literal include to text file.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
12 years agoPGLog: only add entry to caller_ops in add() if reqid_is_indexed()
Samuel Just [Fri, 31 May 2013 18:08:47 +0000 (11:08 -0700)]
PGLog: only add entry to caller_ops in add() if reqid_is_indexed()

Fixes: #5216
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
12 years agoPG: don't write out pg map epoch every handle_activate_map
Samuel Just [Mon, 15 Apr 2013 23:33:48 +0000 (16:33 -0700)]
PG: don't write out pg map epoch every handle_activate_map

We don't actually need to write out the pg map epoch on every
activate_map as long as:
a) the osd does not trim past the oldest pg map persisted
b) the pg does update the persisted map epoch from time
to time.

To that end, we now keep a reference to the last map persisted.
The OSD already does not trim past the oldest live OSDMapRef.
Second, handle_activate_map will trim if the difference between
the current map and the last_persisted_map is large enough.

Fixes: #4731
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 2c5a9f0e178843e7ed514708bab137def840ab89)

Conflicts:

src/common/config_opts.h
src/osd/PG.cc
- last_persisted_osdmap_ref gets set in the non-static
  PG::write_info

Conflicts:

src/osd/PG.cc

12 years agorgw: only append prefetched data if reading from head
Yehuda Sadeh [Thu, 30 May 2013 19:58:11 +0000 (12:58 -0700)]
rgw: only append prefetched data if reading from head

Fixes: #5209
Backport: bobtail, cuttlefish
If the head object wrongfully contains data, but according to the
manifest we don't read from the head, we shouldn't copy the prefetched
data. Also fix the length calculation for that data.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agorgw: don't copy object idtag when copying object
Yehuda Sadeh [Thu, 30 May 2013 16:34:21 +0000 (09:34 -0700)]
rgw: don't copy object idtag when copying object

Fixes: #5204
When copying object we ended up also copying the original
object idtag which overrode the newly generated one. When
refcount put is called with the wrong idtag the count
does't go down.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agoMerge branch 'wip-5046'
Samuel Just [Fri, 31 May 2013 05:39:12 +0000 (22:39 -0700)]
Merge branch 'wip-5046'

Reviewed-by: Samuel Just <sam.just@inktank.com>
12 years agomon: destroy MonitorDBStore before g_ceph_context
Sage Weil [Fri, 31 May 2013 04:43:50 +0000 (21:43 -0700)]
mon: destroy MonitorDBStore before g_ceph_context

Put it on the heap so that we can destroy it before the g_ceph_context
cct that it references.  This fixes a crash like

*** Caught signal (Segmentation fault) **
in thread 4034a80
ceph version 0.63-204-gcf9aa7a (cf9aa7a0037e56eada8b3c1bb59d59d0bfe7bba5)
1: ceph-mon() [0x59932a]
2: (()+0xfcb0) [0x4e41cb0]
3: (Mutex::Lock(bool)+0x1b) [0x6235bb]
4: (PerfCountersCollection::remove(PerfCounters*)+0x27) [0x6a0877]
5: (LevelDBStore::~LevelDBStore()+0x1b) [0x582b2b]
6: (LevelDBStore::~LevelDBStore()+0x9) [0x582da9]
7: (main()+0x1386) [0x48db16]
8: (__libc_start_main()+0xed) [0x658076d]
9: ceph-mon() [0x4909ad]

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agodoc: Updated to reflect glossary usage.
John Wilkins [Fri, 31 May 2013 03:28:22 +0000 (20:28 -0700)]
doc: Updated to reflect glossary usage.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
12 years agodoc: Updated title and syntax to reflect glossary usage.
John Wilkins [Fri, 31 May 2013 03:27:42 +0000 (20:27 -0700)]
doc: Updated title and syntax to reflect glossary usage.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
12 years agodoc: Updated to reflect glossary usage.
John Wilkins [Fri, 31 May 2013 03:27:01 +0000 (20:27 -0700)]
doc: Updated to reflect glossary usage.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
12 years agodoc: Updated title to reflect glossary usage.
John Wilkins [Fri, 31 May 2013 03:26:03 +0000 (20:26 -0700)]
doc: Updated title to reflect glossary usage.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
12 years agodoc: Updated conf with ServerAlias for S3 subdomains.
John Wilkins [Fri, 31 May 2013 03:25:25 +0000 (20:25 -0700)]
doc: Updated conf with ServerAlias for S3 subdomains.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
12 years agodoc: Updated object storage quick start for S3-style subdomains.
John Wilkins [Fri, 31 May 2013 03:24:55 +0000 (20:24 -0700)]
doc: Updated object storage quick start for S3-style subdomains.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
12 years agodoc: Updated text with new glossary terms.
John Wilkins [Fri, 31 May 2013 03:22:58 +0000 (20:22 -0700)]
doc: Updated text with new glossary terms.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
12 years agodoc: Removed FAQ from the index.
John Wilkins [Fri, 31 May 2013 03:21:48 +0000 (20:21 -0700)]
doc: Removed FAQ from the index.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
12 years agodoc: Removed FAQ doc. It's now in the wiki.
John Wilkins [Fri, 31 May 2013 03:21:20 +0000 (20:21 -0700)]
doc: Removed FAQ doc. It's now in the wiki.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
12 years agodebian: guard upstart {start,stop} with -x check
Sage Weil [Fri, 31 May 2013 00:23:36 +0000 (17:23 -0700)]
debian: guard upstart {start,stop} with -x check

Sigh.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoMerge branch 'wip-deb-removal' into next
Sage Weil [Fri, 31 May 2013 00:17:43 +0000 (17:17 -0700)]
Merge branch 'wip-deb-removal' into next

Tested by Tamil, Gary.

12 years agorbd/kernel.sh: quit looking for snapshot sysfs entries
Alex Elder [Thu, 30 May 2013 23:10:46 +0000 (18:10 -0500)]
rbd/kernel.sh: quit looking for snapshot sysfs entries

The sysfs entries for snapshots went away a while ago, and this
script used them to verify sizes matched what was expected.

Instead, look at the mapped size of the snapshot in the places
that used to look for the image's snapshot sysfs files.

Also, switch over to using "udevadm settle" rather than a delay to
wait for udev to do its thing.  Insert them at more appropriate
places--right after "rmd map" commands and before and after the
"rbd unmap" calls.

Stop doing the manual refresh calls as well.  The osd will trigger
refreshes whenever the image size or shapshot context changes.

Finally, the cleanup routine is called initially, when there really
isn't expected to be anything to clean up.  Change the rbd commands
to run there conditionally, only if the target of the command
already exists.

Signed-off-by: Alex Elder <elder@inktank.com>
12 years agoMerge pull request #334 from ceph/wip-mon
Sage Weil [Thu, 30 May 2013 23:27:02 +0000 (16:27 -0700)]
Merge pull request #334 from ceph/wip-mon

Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
12 years agodebian: add radosgw.postinst
Sage Weil [Thu, 30 May 2013 23:22:54 +0000 (16:22 -0700)]
debian: add radosgw.postinst

Start radosgw-all job.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agodebian: invoke-rc.d does not work with upstart jobs
Sage Weil [Thu, 30 May 2013 23:22:40 +0000 (16:22 -0700)]
debian: invoke-rc.d does not work with upstart jobs

Broken by 19c5ac37ef87aeb3d3c30aa35cd61b6f3a8414bf.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agofix test users of LevelDBStore
Sage Weil [Thu, 30 May 2013 22:53:35 +0000 (15:53 -0700)]
fix test users of LevelDBStore

Need to pass in cct.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomove log, ondisklog, missing from PG to PGLog
Loic Dachary [Wed, 22 May 2013 12:14:26 +0000 (14:14 +0200)]
move log, ondisklog, missing from PG to PGLog

PG::log, PG::ondisklog, PG::missing are moved from PG to a new PGLog
class and are made protected data members. It is a preliminary step
before writing unit tests to cover the methods that have side effects
on these data members and define a clean PGLog API. It improves
encapsulation and does not change any of the logic already in
place.

Possible issues :

* an additional reference (PG->PGLog->IndexedLog instead of
  PG->IndexedLog for instance) is introduced : is it optimized ?

* rewriting log.log into pg_log.get_log().log affects the readability
  but should be optimized and have no impact on performances

The guidelines followed for this patch are:

* const access to the data members are preserved, no attempt is made
  to define accessors

* all non const methods are in PGLog, no access to non const methods of
  PGLog::log, PGLog::logondisk and PGLog::missing are provided

* when methods are moved from PG to PGLog the change to their
  implementation is restricted to the minimum.

* the PG::OndiskLog and PG::IndexedLog sub classes are moved
  to PGLog sub classes unmodified and remain public

A const version of the pg_log_t::find_entry method was added.

A const accessor is provided for PGLog::get_log, PGLog::get_missing,
PGLog::get_ondisklog but no non-const accessor.

Arguments are added to most of the methods moved from PG to PGLog so
that they can get access to PG data members such as info or log_oid.

The PGLog method are sorted according to the data member they modify.

//////////////////// missing ////////////////////

* The pg_missing_t::{got,have,need,add,rm} methods are wrapped as
  PGLog::missing_{got,have,need,add,rm}

//////////////////// log ////////////////////

* PGLog::get_tail, PGLog::get_head getters are created

* PGLog::set_tail, PGLog::set_head, PGLog::set_last_requested setters
  are created

* PGLog::index, PGLog::unindex, PGLog::add wrappers,
  PGLog::reset_recovery_pointers are created

* PGLog::clear_info_log replaces PG::clear_info_log

* PGLog::trim replaces PG::trim

//////////////////// log & missing ////////////////////

* PGLog::claim_log is created with code extracted from
  PG::RecoveryState::Stray::react.

* PGLog::split_into is created with code extracted from
  PG::split_into.

* PGLog::recover_got is created with code extracted from
  ReplicatedPG::recover_got.

* PGLog::activate_not_complete is created with code extracted
  from PG::active

* PGLog:proc_replica_log is created with code extracted from
  PG::proc_replica_log

* PGLog:write_log is created with code extracted from
  PG::write_log

* PGLog::merge_old_entry replaces PG::merge_old_entry
  The remove_snap argument is used to collect hobject_t

* PGLog::rewind_divergent_log replaces PG::rewind_divergent_log
  The remove_snap argument is used to collect hobject_t
  A new PG::rewind_divergent_log method is added to call
  remove_snap_mapped_object on each of the remove_snap
  elements

* PGLog::merge_log replaces PG::merge_log
  The remove_snap argument is used to collect hobject_t
  A new PG::merge_log method is added to call
  remove_snap_mapped_object on each of the remove_snap
  elements

* PGLog:write_log is created with code extracted from PG::write_log. A
  non-static version is created for convenience but is a simple
  wrapper.

* PGLog:read_log replaces PG::read_log. A non-static version is
  created for convenience but is a simple wrapper.

* PGLog:read_log_old replaces PG::read_log_old.

http://tracker.ceph.com/issues/5046 refs #5046

Signed-off-by: Loic Dachary <loic@dachary.org>
12 years agoos/WBThrottle: remove asserts in clear()
Samuel Just [Thu, 30 May 2013 22:27:27 +0000 (15:27 -0700)]
os/WBThrottle: remove asserts in clear()

cur_ios, etc may not be zero due to an in progress
flush.

Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agoMerge pull request #335 from ceph/wip-5176
Sage Weil [Thu, 30 May 2013 22:04:21 +0000 (15:04 -0700)]
Merge pull request #335 from ceph/wip-5176

Reviewed-by: Samuel Just <sam.just@inktank.com>
12 years agoos/LevelDBStore: add perfcounters 335/head
Sage Weil [Thu, 30 May 2013 21:57:42 +0000 (14:57 -0700)]
os/LevelDBStore: add perfcounters

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomon: make compaction bounds overlap
Sage Weil [Thu, 30 May 2013 21:36:41 +0000 (14:36 -0700)]
mon: make compaction bounds overlap

When we trim items N to M, compact over range (N-1) to M so that the
items in the queue will share bounds and get merged.  There is no harm in
compacting over a larger range here when the lower bound is a key that
doesn't exist anyway.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoos/LevelDBStore: merge adjacent ranges in compactionqueue
Sage Weil [Thu, 30 May 2013 21:26:42 +0000 (14:26 -0700)]
os/LevelDBStore: merge adjacent ranges in compactionqueue

If we get behind and multiple adjacent ranges end up in the queue, merge
them so that we fire off compaction on larger ranges.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agodoc: note openstack changes for Grizzly
Josh Durgin [Thu, 30 May 2013 21:17:35 +0000 (14:17 -0700)]
doc: note openstack changes for Grizzly

These are just for the cinder configuration, nothing else changed.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agoAdded -r option to usage
Christophe Courtaut [Wed, 29 May 2013 09:07:24 +0000 (11:07 +0200)]
Added -r option to usage

Added the -r option, which starts the radosgw and apache2 to access it
to the usage message.

Signed-off-by: Christophe Courtaut <christophe.courtaut@gmail.com>
12 years agoMerge pull request #333 from ceph/wip-5203
Sage Weil [Thu, 30 May 2013 18:42:45 +0000 (11:42 -0700)]
Merge pull request #333 from ceph/wip-5203

Reviewed-by: Sage Weil <sage@inktank.com>
12 years agomon: fix leak of health_monitor and config_key_service 334/head
Sage Weil [Thu, 30 May 2013 18:07:06 +0000 (11:07 -0700)]
mon: fix leak of health_monitor and config_key_service

Switch to using regular pointers here.  The lifecycle of these services is
very simple such that refcounting is overkill.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomon: return instead of exit(3) via preforker
Sage Weil [Thu, 30 May 2013 00:54:17 +0000 (17:54 -0700)]
mon: return instead of exit(3) via preforker

This lets us run all the locally-scoped dtors so that leak checking will
work.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomon: Monitor: backup monmap using all ceph features instead of quorum's 333/head
Joao Eduardo Luis [Thu, 30 May 2013 17:17:28 +0000 (18:17 +0100)]
mon: Monitor: backup monmap using all ceph features instead of quorum's

When a monitor is freshly created and for some reason its initial sync is
aborted, it will end up with an incorrect backup monmap.  This monmap is
incorrect in the sense that it will not contain the monitor's names as
it will expect on the next run.

This results from us being using the quorum features to encode the monmap
when backing it up, instead of CEPH_FEATURES_ALL.

Fixes: #5203
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
12 years agodebian: stop radosgw daemons on package removal
Sage Weil [Thu, 30 May 2013 15:53:22 +0000 (08:53 -0700)]
debian: stop radosgw daemons on package removal

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agodebian: stop sysvinit ceph-mds daemons
Sage Weil [Thu, 30 May 2013 15:53:05 +0000 (08:53 -0700)]
debian: stop sysvinit ceph-mds daemons

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agodebian: only stop daemons on removea; not upgrade
Sage Weil [Thu, 30 May 2013 15:51:16 +0000 (08:51 -0700)]
debian: only stop daemons on removea; not upgrade

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agorbd/concurrent.sh: probe rbd module at start
Alex Elder [Thu, 30 May 2013 15:10:16 +0000 (10:10 -0500)]
rbd/concurrent.sh: probe rbd module at start

There's no guarantee the rbd module is loaded when this script is
run, so add a line that loads it if necessary.

Signed-off-by: Alex Elder <elder@inktank.com>
12 years agoMerge pull request #331 from ceph/wip-osd-interfacecheck
Sage Weil [Thu, 30 May 2013 05:45:37 +0000 (22:45 -0700)]
Merge pull request #331 from ceph/wip-osd-interfacecheck

Reviewed-by: Samuel Just <sam.just@inktank.com>
12 years agoMerge branch 'next'
Sage Weil [Thu, 30 May 2013 05:44:40 +0000 (22:44 -0700)]
Merge branch 'next'

12 years agoosd: wait for healthy pings from peers in waiting-for-healthy state 331/head
Sage Weil [Wed, 29 May 2013 20:26:45 +0000 (13:26 -0700)]
osd: wait for healthy pings from peers in waiting-for-healthy state

If we are (wrongly) marked down, we need to go into the waiting-for-healthy
state and verify that our network interfaces are working before trying to
rejoin the cluster.

 - make _is_healthy() check require positive proof of pings working
 - do heartbeat checks and updates in this state
 - reset the random peers every heartbeat_interval, in case we keep picking
   bad ones

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: distinguish between definitely healthy and definitely not unhealthy
Sage Weil [Wed, 29 May 2013 20:15:41 +0000 (13:15 -0700)]
osd: distinguish between definitely healthy and definitely not unhealthy

is_unhealthy() will assume they are healthy for some period after we
send our first ping attempt.  is_healthy() is now a strict check that we
know they are healthy.

Switch the failure report check to use is_unhealthy(); use is_healthy()
everywhere else, including the waiting-for-healthy pre-boot checks.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: remove down hb peers
Sage Weil [Wed, 29 May 2013 19:24:28 +0000 (12:24 -0700)]
osd: remove down hb peers

If a (say, random) peer goes down, filter it out.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: only add pg peers if active
Sage Weil [Wed, 29 May 2013 19:24:04 +0000 (12:24 -0700)]
osd: only add pg peers if active

We will soon be in this method for the waiting-for-healthy state.  As
a consequence, we need to remove any down peers.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: factor out _remove_heartbeat_peer
Sage Weil [Wed, 29 May 2013 19:16:28 +0000 (12:16 -0700)]
osd: factor out _remove_heartbeat_peer

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: augment osd heartbeat peers with neighbors and randoms, to up some min
Sage Weil [Wed, 29 May 2013 18:27:38 +0000 (11:27 -0700)]
osd: augment osd heartbeat peers with neighbors and randoms, to up some min

- always include our neighbors to ensure we have a fully-connected
  graph
- include some random neighbors to get at least some min number of peers.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: initialize new_state field when we use it
Sage Weil [Wed, 29 May 2013 23:50:04 +0000 (16:50 -0700)]
osd: initialize new_state field when we use it

If we use operator[] on a new int field its value is undefined; avoid
reading it or using |= et al until we initialize it.

Fixes: #4967
Backport: cuttlefish, bobtail
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: David Zafman <david.zafman@inktank.com>
12 years agoMerge branch 'wip_osd_throttle'
Samuel Just [Wed, 29 May 2013 22:06:18 +0000 (15:06 -0700)]
Merge branch 'wip_osd_throttle'

Fixes: #4782
Reviewed-by: Sage Weil
12 years agoWBThrottle: add some comments and some asserts
Samuel Just [Wed, 29 May 2013 22:05:51 +0000 (15:05 -0700)]
WBThrottle: add some comments and some asserts

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoWBThrottle: rename replica nocache
Samuel Just [Wed, 29 May 2013 22:05:34 +0000 (15:05 -0700)]
WBThrottle: rename replica nocache

We may want to influence the caching behavior for other
reasons.

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoosd: move health checks into a single helper
Sage Weil [Mon, 27 May 2013 22:27:59 +0000 (15:27 -0700)]
osd: move health checks into a single helper

For now we still only look at the internal heartbeats.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: avoid duplicate mon requests for a new osdmap
Sage Weil [Wed, 29 May 2013 20:16:24 +0000 (13:16 -0700)]
osd: avoid duplicate mon requests for a new osdmap

sub_want() returns true if this is a new sub; only renew then.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: tell peers that ping us if they are dead
Sage Weil [Wed, 29 May 2013 20:16:01 +0000 (13:16 -0700)]
osd: tell peers that ping us if they are dead

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: simplify is_healthy() check during boot
Sage Weil [Mon, 27 May 2013 22:24:56 +0000 (15:24 -0700)]
osd: simplify is_healthy() check during boot

This has a slight behavior change in that we ask the mon for the latest
osdmap if our internal heartbeat is failing.  That isn't useful yet, but
will be shortly.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomds: stay in SCAN state in file_eval
Sage Weil [Tue, 28 May 2013 17:51:11 +0000 (10:51 -0700)]
mds: stay in SCAN state in file_eval

If we are in the SCAN state, stay there until the recovery finishes.  Do
not jump to another state from file_eval().

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 0071b8e75bd3f5a09cc46e2225a018f6d1ef0680)

12 years agomds: stay in SCAN state in file_eval
Sage Weil [Tue, 28 May 2013 17:51:11 +0000 (10:51 -0700)]
mds: stay in SCAN state in file_eval

If we are in the SCAN state, stay there until the recovery finishes.  Do
not jump to another state from file_eval().

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoMakefile: include new message header files
Sage Weil [Tue, 28 May 2013 22:52:46 +0000 (15:52 -0700)]
Makefile: include new message header files

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoMerge remote-tracking branch 'yan/wip-mds'
Sage Weil [Wed, 29 May 2013 17:26:56 +0000 (10:26 -0700)]
Merge remote-tracking branch 'yan/wip-mds'

Reviewed-by: Sage Weil <sage@inktank.com>
Conflicts:
src/mds/MDCache.cc

12 years agoosd: do not assume head obc object exists when getting snapdir
Sage Weil [Wed, 29 May 2013 16:49:11 +0000 (09:49 -0700)]
osd: do not assume head obc object exists when getting snapdir

For a list-snaps operation on the snapdir, do not assume that the obc for the
head means the object exists.  This fixes a race between a head deletion and
a list-snaps that wrongly returns ENOENT, triggered by the DiffItersateStress
test when thrashing OSDs.

Fixes: #5183
Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
12 years agomon: compact trimmed range, not entire prefix
Sage Weil [Wed, 29 May 2013 15:40:32 +0000 (08:40 -0700)]
mon: compact trimmed range, not entire prefix

This will reduce the work that leveldb is asked to do by only triggering
compaction of the keys that were just trimmed.

We ma want to further reduce the work by compacting less frequently, but
this is at least a step in that direction.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomon/MonitorDBStore: allow compaction of ranges
Sage Weil [Wed, 29 May 2013 15:35:44 +0000 (08:35 -0700)]
mon/MonitorDBStore: allow compaction of ranges

Allow a transaction to describe the compaction of a range of keys.  Do this
in a backward compatible say, such that older code will interpret the
compaction of a prefix + range as compaction of the entire prefix.  This
allows us to avoid introducing any new feature bits.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoos/LevelDBStore: allow compaction of key ranges
Sage Weil [Wed, 29 May 2013 15:34:13 +0000 (08:34 -0700)]
os/LevelDBStore: allow compaction of key ranges

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoMerge pull request #329 from javacruft/wip-fuse-deps
Sage Weil [Wed, 29 May 2013 15:14:27 +0000 (08:14 -0700)]
Merge pull request #329 from javacruft/wip-fuse-deps

Use new fuse package instead of fuse-utils

12 years agoUse new fuse package instead of fuse-utils 329/head
James Page [Wed, 29 May 2013 09:57:17 +0000 (10:57 +0100)]
Use new fuse package instead of fuse-utils

The fuse-utils package was deprecated a while ago.

Switch the primary dependency for fuse tools to use
the preferred 'fuse' package.

Signed-off-by: James Page <james.page@ubuntu.com>
12 years agomon: disable tdump by default
Sage Weil [Wed, 29 May 2013 05:13:11 +0000 (22:13 -0700)]
mon: disable tdump by default

Grr.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoMerge remote-tracking branch 'gh/last'
Sage Weil [Wed, 29 May 2013 05:10:21 +0000 (22:10 -0700)]
Merge remote-tracking branch 'gh/last'

12 years agoMerge branch 'wip-5172'
Sage Weil [Wed, 29 May 2013 03:44:48 +0000 (20:44 -0700)]
Merge branch 'wip-5172'

Reviewed-by: Samuel Just <sam.just@inktank.com>
12 years agoos/LevelDBStore: do compact_prefix() work asynchronously
Sage Weil [Tue, 28 May 2013 23:35:55 +0000 (16:35 -0700)]
os/LevelDBStore: do compact_prefix() work asynchronously

We generally do not want to block while compacting a range of leveldb.
Push the blocking+waiting off to a separate thread.  (leveldb will do what
it can to avoid blocking internally; no reason for us to wait explicitly.)

This addresses part of #5176.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: fix note_down_osd
Sage Weil [Wed, 29 May 2013 03:38:43 +0000 (20:38 -0700)]
osd: fix note_down_osd

Fix bug introduced in 27381c0c6259ac89f5f9c592b4bfb585937a1cfc.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: fix hb con failure handler
Sage Weil [Wed, 29 May 2013 03:39:30 +0000 (20:39 -0700)]
osd: fix hb con failure handler

Fix a few bugs introduced by 27381c0c6259ac89f5f9c592b4bfb585937a1cfc:

- check against both front and back cons; either one may have failed.
- close *both* front and back before reopening either.  this is
  overkill, but slightly simpler code.
- fix leak of con when marking down
- handle race against osdmap update and note_down_osd

Fixes: #5172
Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoMerge pull request #319 from dalgaaf/wip-da-pylint-3
Sage Weil [Wed, 29 May 2013 02:52:41 +0000 (19:52 -0700)]
Merge pull request #319 from dalgaaf/wip-da-pylint-3

Fix some smaller Python issues

12 years agoMerge pull request #326 from dalgaaf/wip-da-CID-727978
Sage Weil [Tue, 28 May 2013 22:48:11 +0000 (15:48 -0700)]
Merge pull request #326 from dalgaaf/wip-da-CID-727978

kv_flat_btree_async.cc: fix AioCompletion resource leak

12 years agov0.63 v0.63
Gary Lowell [Tue, 28 May 2013 20:58:22 +0000 (13:58 -0700)]
v0.63

12 years agoHashIndex: sync top directory during start_split,merge,col_split
Samuel Just [Tue, 28 May 2013 18:10:05 +0000 (11:10 -0700)]
HashIndex: sync top directory during start_split,merge,col_split

Otherwise, the links might be ordered after the in progress
operation tag write.  We need the in progress operation tag to
correctly recover from an interrupted merge, split, or col_split.

Fixes: #5180
Backport: cuttlefish, bobtail
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
12 years agodoc/dev/osd_internals: add wbthrottle.rst 332/head
Samuel Just [Fri, 24 May 2013 20:35:14 +0000 (13:35 -0700)]
doc/dev/osd_internals: add wbthrottle.rst

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoWBThrottle: add perfcounters
Samuel Just [Tue, 28 May 2013 17:41:52 +0000 (10:41 -0700)]
WBThrottle: add perfcounters

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoMerge pull request #325 from dalgaaf/wip-da-CID-727980
Sage Weil [Tue, 28 May 2013 17:27:56 +0000 (10:27 -0700)]
Merge pull request #325 from dalgaaf/wip-da-CID-727980

kv_flat_btree_async.cc: fix AioCompletion resource leak

12 years agoMerge pull request #324 from dalgaaf/wip-da-CID-727979
Sage Weil [Tue, 28 May 2013 17:27:25 +0000 (10:27 -0700)]
Merge pull request #324 from dalgaaf/wip-da-CID-727979

kv_flat_btree_async.cc: fix AioCompletion resource leak

12 years agoosd/OSDMap: fix Incremental dump
Sage Weil [Tue, 28 May 2013 16:16:17 +0000 (09:16 -0700)]
osd/OSDMap: fix Incremental dump

The front hb addr entry may not be present.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoMerge pull request #322 from guilhem/patch-1
Sage Weil [Tue, 28 May 2013 15:43:10 +0000 (08:43 -0700)]
Merge pull request #322 from guilhem/patch-1

Reviewed-by: Sage Weil <sage@inktank.com>
12 years agokv_flat_btree_async.cc: fix AioCompletion resource leak 326/head
Danny Al-Gaaf [Tue, 28 May 2013 10:43:12 +0000 (12:43 +0200)]
kv_flat_btree_async.cc: fix AioCompletion resource leak

Call AioCompletion::release() if the completion is no longer needed.

CID 727978 (#1-2 of 2): Resource leak (RESOURCE_LEAK)
  leaked_storage: Variable "obj_aioc" going out of scope leaks the
  storage it points to.

Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
12 years agokv_flat_btree_async.cc: fix AioCompletion resource leak 324/head
Danny Al-Gaaf [Tue, 28 May 2013 10:38:57 +0000 (12:38 +0200)]
kv_flat_btree_async.cc: fix AioCompletion resource leak

Call AioCompletion::release() if the completion is no longer needed.

CID 727979 (#1-2 of 2): Resource leak (RESOURCE_LEAK)
  leaked_storage: Variable "a" going out of scope leaks the storage
  it points to.

Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
12 years agokv_flat_btree_async.cc: fix AioCompletion resource leak 325/head
Danny Al-Gaaf [Tue, 28 May 2013 10:27:37 +0000 (12:27 +0200)]
kv_flat_btree_async.cc: fix AioCompletion resource leak

Call AioCompletion::release() if the completion is no longer
needed.

CID 727980 (#1-4 of 4): Resource leak (RESOURCE_LEAK)
  leaked_storage: Variable "aioc" going out of scope leaks
  the storage it points to.

Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>