]> git.apps.os.sepia.ceph.com Git - ceph.git/log
ceph.git
13 years agodoc: discuss choice of pg_num
Sage Weil [Mon, 16 Jul 2012 23:44:05 +0000 (16:44 -0700)]
doc: discuss choice of pg_num

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agolog: simplify log logic a bit
Sage Weil [Mon, 16 Jul 2012 23:18:51 +0000 (16:18 -0700)]
log: simplify log logic a bit

Whether an entry is eligible to log/dump is independent of the channel it
is sent to.  Some channels impose additional restrictions.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoMerge branch 'next'
Josh Durgin [Tue, 17 Jul 2012 00:36:06 +0000 (17:36 -0700)]
Merge branch 'next'

13 years agoRobustify ceph-rbdnamer and adapt udev rules
Pascal de Bruijn | Unilogic Networks B.V [Wed, 11 Jul 2012 13:23:16 +0000 (15:23 +0200)]
Robustify ceph-rbdnamer and adapt udev rules

Below is a patch which makes the ceph-rbdnamer script more robust and
fixes a problem with the rbd udev rules.

On our setup we encountered a symlink which was linked to the wrong rbd:

  /dev/rbd/mypool/myrbd -> /dev/rbd1

While that link should have gone to /dev/rbd3 (on which a
partition /dev/rbd3p1 was present).

Now the old udev rule passes %n to the ceph-rbdnamer script, the problem
with %n is that %n results in a value of 3 (for rbd3), but in a value of
1 (for rbd3p1), so it seems it can't be depended upon for rbdnaming.

In the patch below the ceph-rbdnamer script is made more robust and it
now it can be called in various ways:

  /usr/bin/ceph-rbdnamer /dev/rbd3
  /usr/bin/ceph-rbdnamer /dev/rbd3p1
  /usr/bin/ceph-rbdnamer rbd3
  /usr/bin/ceph-rbdnamer rbd3p1
  /usr/bin/ceph-rbdnamer 3

Even with all these different styles of calling the modified script, it
should now return the same rbdname. This change "has" to be combined
with calling it from udev with %k though.

With that fixed, we hit the second problem. We ended up with:

  /dev/rbd/mypool/myrbd -> /dev/rbd3p1

So the rbdname was symlinked to the partition on the rbd instead of the
rbd itself. So what probably went wrong is udev discovering the disk and
running ceph-rbdnamer which resolved it to myrbd so the following
symlink was created:

  /dev/rbd/mypool/myrbd -> /dev/rbd3

However partitions would be discovered next and ceph-rbdnamer would be
run with rbd3p1 (%k) as parameter, resulting in the name myrbd too, with
the previous correct symlink being overwritten with a faulty one:

  /dev/rbd/mypool/myrbd -> /dev/rbd3p1

The solution to the problem is in differentiating between disks and
partitions in udev and handling them slightly differently. So with the
patch below partitions now get their own symlinks in the following style
(which is fairly consistent with other udev rules):

  /dev/rbd/mypool/myrbd-part1 -> /dev/rbd3p1

Please let me know any feedback you have on this patch or the approach
used.

Regards,
Pascal de Bruijn
Unilogic B.V.

Signed-off-by: Pascal de Bruijn <pascal@unilogicnetworks.net>
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
13 years agodoc/radosgw/config.rst: mended small typo
caleb miles [Mon, 16 Jul 2012 23:30:36 +0000 (16:30 -0700)]
doc/radosgw/config.rst: mended small typo

Signed-off-by: caleb miles <caleb.miles@inktank.com>
13 years agoMerge branch 'next'
Sage Weil [Mon, 16 Jul 2012 23:13:55 +0000 (16:13 -0700)]
Merge branch 'next'

13 years agoMerge branch 'wip-mon-mkfs'
Sage Weil [Mon, 16 Jul 2012 23:15:33 +0000 (16:15 -0700)]
Merge branch 'wip-mon-mkfs'

Reviewed-by: Tommi Virtanen <tv@inktank.com>
13 years agomkcephfs: nicer empty directory check
Sage Weil [Mon, 16 Jul 2012 23:10:57 +0000 (16:10 -0700)]
mkcephfs: nicer empty directory check

From TV.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agomkcephfs: error out if mon data directory is not empty
Sage Weil [Tue, 10 Jul 2012 01:16:44 +0000 (18:16 -0700)]
mkcephfs: error out if mon data directory is not empty

The ceph-mon --mkfs function no longer wipes out the directory; it is in
fact mostly a no-op that just verifies the dir exists.

So, ensure that the directory is empty at mkfs time.  This could
alternatively do an 'rm -r' in that directory (that is in fact what
ceph-mon used to do), but this is safer.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agovstart.sh: blow away mon directory on creation/start
Sage Weil [Tue, 10 Jul 2012 01:17:54 +0000 (18:17 -0700)]
vstart.sh: blow away mon directory on creation/start

Now that ceph-mon doesn't blow away the mon data content, we need to.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agomon: stop doing rm -rf on mon mkfs
Sage Weil [Tue, 10 Jul 2012 01:17:16 +0000 (18:17 -0700)]
mon: stop doing rm -rf on mon mkfs

Simply verify that the directory exists, or if it doesn't, create it.
Do nothing about its content.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agolog: apply log_level to stderr/syslog logic
Sage Weil [Mon, 16 Jul 2012 23:02:14 +0000 (16:02 -0700)]
log: apply log_level to stderr/syslog logic

In non-crash situations, we want to make sure the message is both below the
syslog/stderr threshold and also below the normal log threshold.  Otherwise
we get anything we gather on those channels, even when the log level is
low.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agolog: dump logging levels in crash dump
Sage Weil [Mon, 16 Jul 2012 22:40:03 +0000 (15:40 -0700)]
log: dump logging levels in crash dump

So you know what you are/are not seeing.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoMerge branch 'next'
Sage Weil [Mon, 16 Jul 2012 22:53:54 +0000 (15:53 -0700)]
Merge branch 'next'

13 years agoPG: grab reference to pg in C_OSD_AppliedRecoveredObject
Samuel Just [Mon, 16 Jul 2012 22:43:47 +0000 (15:43 -0700)]
PG: grab reference to pg in C_OSD_AppliedRecoveredObject

Otherwise, accessing the pg via _applied_recovered_object
isn't safe.  Using intrusive_ptr clarifies the reference
ownership.

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agolog: fix event gather condition
Sage Weil [Mon, 16 Jul 2012 22:40:53 +0000 (15:40 -0700)]
log: fix event gather condition

We should gather an event if it is below the log or gather threshold.

Previously we were only gathering if we were going to print it, which makes
the dump no more useful than what was already logged.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoPG::RecoveryState::Stray::react(LogEvt&): set dirty_info/log
Samuel Just [Mon, 16 Jul 2012 20:14:43 +0000 (13:14 -0700)]
PG::RecoveryState::Stray::react(LogEvt&): set dirty_info/log

We adjust the info and the log, so we must set dirty_info and
dirty_log to force writes.

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agoPG: use stats from primary after rewinding divergent entries
Samuel Just [Mon, 16 Jul 2012 20:07:56 +0000 (13:07 -0700)]
PG: use stats from primary after rewinding divergent entries

If the osd recieving the info has divergent entries, it will
also have a "divergent" stat structure.

Probably fixes #2769.

In cases like #2769, this bug can result in a primary with a stat
structure which double counts an operation: once for the
divergent operation, and once for the replay.

This is another way for the bug addressed in
5924f8e4a8c29e6de326a9e8576c30109cdc0e07 to happen.

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agoMerge remote-tracking branch 'upstream/next'
Samuel Just [Mon, 16 Jul 2012 21:18:04 +0000 (14:18 -0700)]
Merge remote-tracking branch 'upstream/next'

13 years agoPG::RecoveryState::Stray::react(LogEvt&): reset last_pg_scrub
Samuel Just [Mon, 16 Jul 2012 20:11:24 +0000 (13:11 -0700)]
PG::RecoveryState::Stray::react(LogEvt&): reset last_pg_scrub

We need to reset the last_pg_scrub data in the osd since we
are replacing the info.

Probably fixes #2453

In cases like 2453, we hit the following backtrace:

     0> 2012-05-19 17:24:09.113684 7fe66be3d700 -1 osd/OSD.h: In function 'void OSD::unreg_last_pg_scrub(pg_t, utime_t)' thread 7fe66be3d700 time 2012-05-19 17:24:09.095719
osd/OSD.h: 840: FAILED assert(last_scrub_pg.count(p))

 ceph version 0.46-313-g4277d4d (commit:4277d4d3378dde4264e2b8d211371569219c6e4b)
 1: (OSD::unreg_last_pg_scrub(pg_t, utime_t)+0x149) [0x641f49]
 2: (PG::proc_primary_info(ObjectStore::Transaction&, pg_info_t const&)+0x5e) [0x63383e]
 3: (PG::RecoveryState::ReplicaActive::react(PG::RecoveryState::MInfoRec const&)+0x4a) [0x633eda]
 4: (boost::statechart::detail::reaction_result boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl<boost::mpl::list3<boost::statechart::custom_reaction<PG::RecoveryState::MQuery>, boost::statechart::custom_reaction<PG::RecoveryState::MInfoRec>, boost::statechart::custom_reaction<PG::RecoveryState::MLogRec> >, boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0> >(boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>&, boost::statechart::event_base const&, void const*)+0x130) [0x6466a0]
 5: (boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x81) [0x646791]
 6: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x63dfcb]
 7: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x11) [0x63e0f1]
 8: (PG::RecoveryState::handle_info(int, pg_info_t&, PG::RecoveryCtx*)+0x177) [0x616987]
 9: (OSD::handle_pg_info(std::tr1::shared_ptr<OpRequest>)+0x665) [0x5d3d15]
 10: (OSD::dispatch_op(std::tr1::shared_ptr<OpRequest>)+0x2a0) [0x5d7370]
 11: (OSD::_dispatch(Message*)+0x191) [0x5dd4a1]
 12: (OSD::ms_dispatch(Message*)+0x153) [0x5ddda3]
 13: (SimpleMessenger::dispatch_entry()+0x863) [0x77fbc3]
 14: (SimpleMessenger::DispatchThread::entry()+0xd) [0x746c5d]
 15: (()+0x7efc) [0x7fe679b1fefc]
 16: (clone()+0x6d) [0x7fe67815089d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Because we don't clear the scrub state before reseting info,
the last_scrub_stamp state in the info.history structure
changes without updating the osd state resulting in the
above assert failure.

Backport: stable

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agodoc/dev/osd_internals: OSD overview, pg removal, map/message handling
Samuel Just [Wed, 11 Jul 2012 00:52:21 +0000 (17:52 -0700)]
doc/dev/osd_internals: OSD overview, pg removal, map/message handling

This is a start on some osd internals documentation for new
developers.

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agoPG: Place info in biginfo object
Samuel Just [Fri, 22 Jun 2012 17:11:38 +0000 (10:11 -0700)]
PG: Place info in biginfo object

The purged_snaps set can grow without bound as snaps are
created and removed.  Because the filestore doesn't
provide unlimited size collection attributes, it's better
to place the full info on the biginfo object, since we
need to write it during write_info anyway.

Added CEPH_OSD_FEATURE_INCOMPAT_BIGINFO to prevent downgrade.

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agoPG: use write_info to set snap_collections in make_snap_collections
Samuel Just [Fri, 29 Jun 2012 20:39:49 +0000 (13:39 -0700)]
PG: use write_info to set snap_collections in make_snap_collections

At one point, snap_collections were written to a pg collection
attribute.  Subsequently, they were moved to the biginfo object
since the structure can grow too large for limited size xattrs.
make_snap_collection, however, was not updated.

Using write_info here should prevent this from happening in
the future.

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agoOSD: set superblock compat_features on boot and mkfs
Samuel Just [Fri, 13 Jul 2012 23:44:33 +0000 (16:44 -0700)]
OSD: set superblock compat_features on boot and mkfs

Previously, we did not actually persist the osd compatibility
mask.  Without persisting the current compat mask, a previous,
incompatible version of the OSD would not be prevented from
starting on the same store.

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agoCompatSet: users pass bit indices rather than masks
Samuel Just [Fri, 13 Jul 2012 21:23:27 +0000 (14:23 -0700)]
CompatSet: users pass bit indices rather than masks

CompatSet users number the Feature objects rather than
providing masks.  Thus, we should do

mask |= (1 << f.id) rather than mask |= f.id.

In order to detect old, broken encodings, the lowest
bit will be set in memory but not set in the encoding.
We can reconstruct the correct mask from the names map.

This bug can cause an incompat bit to not be detected
since 1|2 == 1|2|3.

fixes: #2748

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agoosd: based misdirected op role calc on acting set
Sage Weil [Sat, 14 Jul 2012 21:31:34 +0000 (14:31 -0700)]
osd: based misdirected op role calc on acting set

We want to look at the acting set here, nothing else.  This was causing us
to erroneously queue ops for later (wasting memory) and to erroneously
print out a 'misdrected op' message in the cluster log (confusion and
incorrect [but ignored] -ENXIO reply).

Fixes: #2022
Signed-off-by: Sage Weil <sage@inktank.com>
13 years agomon/MonitorStore: always O_TRUNC when writing states
Sage Weil [Mon, 16 Jul 2012 03:30:34 +0000 (20:30 -0700)]
mon/MonitorStore: always O_TRUNC when writing states

It is possible for a .new file to already exist, potentially with a
larger size.  This would happen if:

 - we were proposing a different value
 - we crashed (or were stopped) before it got renamed into place
 - after restarting, a different value was proposed and accepted.

This isn't so unlikely for the log state machine, where we're
aggregating random messages.  O_TRUNC ensure we avoid getting the tail
end of some previous junk.

I observed #2593 and found that a logm state value had a larger size on
one mon (after slurping) than the others, pointing to put_bl_sn_map().

While we are at it, O_TRUNC put_int() too; the same type of bug is
possible there, too.

Fixes: #2593
Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoMerge remote-tracking branch 'gh/bugfix-2022'
Sage Weil [Mon, 16 Jul 2012 17:48:25 +0000 (10:48 -0700)]
Merge remote-tracking branch 'gh/bugfix-2022'

Reviewed-by: Samuel Just <sam.just@inktank.com>
13 years agoMerge remote-tracking branch 'gh/bugfix-2779'
Sage Weil [Mon, 16 Jul 2012 16:12:09 +0000 (09:12 -0700)]
Merge remote-tracking branch 'gh/bugfix-2779'

Reviewed-by: Greg Farnum <greg@inktank.com>
13 years agomon: remove osds from [near]full sets when their stats are removed from pgmap
Sage Weil [Mon, 16 Jul 2012 05:03:31 +0000 (22:03 -0700)]
mon: remove osds from [near]full sets when their stats are removed from pgmap

Greg points out that we could have a situation like:

 - mon recovers..
 - goes through osdmaps, notes an osd was removed and removes from
   full/nearfull
 - goes through pgmaps, and re-adds it when it encounters some osd_stat_ts.

Fix this by removing the osd from the full/nearfull set when we remove
the osd_stat_t from the pgmap.  Any osd removal is always followed by
an osd_stat_rm[] record when the primary processes the new osdmap and
proposed the appropriate pgmap updates.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agomon/MonitorStore: always O_TRUNC when writing states
Sage Weil [Mon, 16 Jul 2012 03:30:34 +0000 (20:30 -0700)]
mon/MonitorStore: always O_TRUNC when writing states

It is possible for a .new file to already exist, potentially with a
larger size.  This would happen if:

 - we were proposing a different value
 - we crashed (or were stopped) before it got renamed into place
 - after restarting, a different value was proposed and accepted.

This isn't so unlikely for the log state machine, where we're
aggregating random messages.  O_TRUNC ensure we avoid getting the tail
end of some previous junk.

I observed #2593 and found that a logm state value had a larger size on
one mon (after slurping) than the others, pointing to put_bl_sn_map().

While we are at it, O_TRUNC put_int() too; the same type of bug is
possible there, too.

Fixes: #2593
Signed-off-by: Sage Weil <sage@inktank.com>
13 years agofilestore: dump open fds when we hit EMFILE
Sage Weil [Sun, 15 Jul 2012 22:21:57 +0000 (15:21 -0700)]
filestore: dump open fds when we hit EMFILE

Use a helper to dump /proc/self/fd when we hit EMFILE in the filestore.
Ideally, we should trigger this in other appropriate places, but it is
not immediately clear that there is a sane way to do that.

Fixes: #2330
Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoosdmap: drop useless and unused get_pg_role() method
Sage Weil [Sat, 14 Jul 2012 21:32:28 +0000 (14:32 -0700)]
osdmap: drop useless and unused get_pg_role() method

Users probably want get_pg_acting_rank().  If they don't, they can probably
have the mapping and can calculate the rank themselves.  Having this here
is asking for bugs like #2022.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoosd: based misdirected op role calc on acting set
Sage Weil [Sat, 14 Jul 2012 21:31:34 +0000 (14:31 -0700)]
osd: based misdirected op role calc on acting set

We want to look at the acting set here, nothing else.  This was causing us
to erroneously queue ops for later (wasting memory) and to erroneously
print out a 'misdrected op' message in the cluster log (confusion and
incorrect [but ignored] -ENXIO reply).

Fixes: #2022
Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoosd: simplify helper usage for misdirected ops
Sage Weil [Sat, 14 Jul 2012 21:29:29 +0000 (14:29 -0700)]
osd: simplify helper usage for misdirected ops

Make the helper exclusively for the PG != NULL cases, and open-code the
one PG == NULL caller.  This is simpler, and lets us include more useful
information in the log message.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agovstart: use absolute path for keyring
Noah Watkins [Sat, 14 Jul 2012 00:29:56 +0000 (17:29 -0700)]
vstart: use absolute path for keyring

Stores absolute path to the generated keyring so that tests running in
other directories (e.g. src/java/test) can simply reference the
generated ceph.conf.

Signed-off-by: Noah Watkins <jawhawk@cs.ucsc.edu>
13 years agoOSD: add config options to fake missed pings
Samuel Just [Fri, 13 Jul 2012 20:45:24 +0000 (13:45 -0700)]
OSD: add config options to fake missed pings

In order to test monitor and osd failure detection and false
positive correction, this patch adds the following options:

 1. osd_debug_drop_ping_probability: probability of dropping
    a string of pings from a client upon ping recipt.
 2. osd_debug_drop_ping_duration: number of pings to drop in
    a row.

This should help with replicating some wrongly-marked-down
thrashing cases.

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agocrushtool: allow information generated during testing to be dumped
caleb miles [Fri, 6 Jul 2012 00:30:01 +0000 (17:30 -0700)]
crushtool: allow information generated during testing to be dumped
to a set of CSV files for off-line analysis.

Signed-off-by: caleb miles <caleb.miles@inktank.com>
13 years agodoc: remove last reference to ceph-cookbooks.
John Wilkins [Fri, 13 Jul 2012 21:16:08 +0000 (14:16 -0700)]
doc: remove last reference to ceph-cookbooks.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
13 years agodoc: cookbooks issue resolved, so changed 'ceph-cookbooks' back to 'ceph.'
John Wilkins [Fri, 13 Jul 2012 21:08:41 +0000 (14:08 -0700)]
doc: cookbooks issue resolved, so changed 'ceph-cookbooks' back to 'ceph.'

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
13 years agoqa: download tests from specified branch
Josh Durgin [Fri, 13 Jul 2012 16:42:20 +0000 (09:42 -0700)]
qa: download tests from specified branch

These python tests aren't installed, so they need to be downloaded

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
13 years agoOSD: send_still_alive when we get a reply if we reported failure
Samuel Just [Fri, 13 Jul 2012 16:20:02 +0000 (09:20 -0700)]
OSD: send_still_alive when we get a reply if we reported failure

When we get a ping reply, remove the peer from the failure_queue
and send a still alive message if the peer is in the failure_pending
map.

Otherwise, the monitor could slowly accumulate sporadic failure reports
leading to an osd being incorrectly marked out.

This bug may have been contributing to the wrongly-marked-down
thrashing observed on some systems.

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agoPG: merge_log always use stats from authoritative replica
Samuel Just [Fri, 13 Jul 2012 00:19:43 +0000 (17:19 -0700)]
PG: merge_log always use stats from authoritative replica

If the osd recieving the log has divergent entries, it will
also have a "divergent" stat structure.  In general, it suffices
to simply trust the stat structure shipped with the authoritative
log and info since merge_log is only used to merge an authoritative
log.

Probably fixes #2769.

In cases like #2769, this bug can result in a primary with a stat
structure which double counts an operation: once for the
divergent operation, and once for the replay.  It turned up
in a regression suite run as a scrub stat mismatch.

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agoqa: download tests from specified branch
Josh Durgin [Fri, 13 Jul 2012 16:42:20 +0000 (09:42 -0700)]
qa: download tests from specified branch

These python tests aren't installed, so they need to be downloaded

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
13 years agomon: use single helper for [near]full sets
Sage Weil [Fri, 13 Jul 2012 14:27:36 +0000 (07:27 -0700)]
mon: use single helper for [near]full sets

Use a single helper to add/remove osds from the [near]full sets.  This
keeps the logic in a single place, and simplifies the code somewhat.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agomon: purge removed osds from [near]full sets
Sage Weil [Fri, 13 Jul 2012 13:10:07 +0000 (06:10 -0700)]
mon: purge removed osds from [near]full sets

The [near]full sets are volatile state.  Remove removed (or created)
osds from the set when we process a map.

Fixes: #2779
Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoReplicatedPG: don't mark repop done until apply completes
Samuel Just [Thu, 12 Jul 2012 23:45:26 +0000 (16:45 -0700)]
ReplicatedPG: don't mark repop done until apply completes

Consider the following sequence:
1. issue, apply repop
2. replicas and primary commit
  Here, repop->waitfor_(ack|disk) are empty, so we mark
  repop->done and remove_repop.
3. interval change, repops still in queue are marked aborted
4. activate, last_update_applied = last_update
5. the repop from one enters apply_repop, is not aborted,
   and finds that last_update_applied has passed it by.

Fixes #2749

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agotest_librbd: fix warnings
Sage Weil [Thu, 12 Jul 2012 23:14:33 +0000 (16:14 -0700)]
test_librbd: fix warnings

test/test_librbd.cc: In member function ‘virtual void LibRBD_TestClone_Test::TestBody()’:
warning: test/test_librbd.cc:1040:111: format ‘%ld’ expects argument of type ‘long int’, but argument 2 has type ‘uint64_t {aka long long unsigned int}’ [-Wformat]
warning: test/test_librbd.cc:1040:111: format ‘%ld’ expects argument of type ‘long int’, but argument 3 has type ‘uint64_t {aka long long unsigned int}’ [-Wformat]
warning: test/test_librbd.cc:1040:111: format ‘%ld’ expects argument of type ‘long int’, but argument 4 has type ‘int64_t {aka long long int}’ [-Wformat]

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoReplicatedPG,PG: dump recovery/backfill state on pg query
Samuel Just [Fri, 6 Jul 2012 21:38:57 +0000 (14:38 -0700)]
ReplicatedPG,PG: dump recovery/backfill state on pg query

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agoMerge remote-tracking branch 'gh/wip-2101'
Sage Weil [Thu, 12 Jul 2012 20:11:33 +0000 (13:11 -0700)]
Merge remote-tracking branch 'gh/wip-2101'

13 years agorbd: enable layering when using the new format
Josh Durgin [Thu, 12 Jul 2012 18:13:47 +0000 (11:13 -0700)]
rbd: enable layering when using the new format

We'll add options for different features later.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
13 years agodoc: reverted file and role names.
John Wilkins [Thu, 12 Jul 2012 18:46:43 +0000 (11:46 -0700)]
doc: reverted file and role names.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
13 years agoupstart: Make ceph-osd always set the crush location.
Tommi Virtanen [Thu, 12 Jul 2012 17:47:29 +0000 (10:47 -0700)]
upstart: Make ceph-osd always set the crush location.

This used to be conditional on config having osd_crush_location set,
but with that, minimal configuration left the OSD completely out of
the crush map, and prevented the OSD from starting properly.

Note: Ceph does not currently let this mechanism automatically move
hosts to another location in the CRUSH hierarchy. This means if you
let this run with defaults, setting osd_crush_location later will not
take effect. Set up your config file (or Chef environment) fully
before starting the OSDs the first time.

Signed-off-by: Tommi Virtanen <tv@inktank.com>
13 years agodoc: fix config metavariables discussion
Sage Weil [Thu, 12 Jul 2012 16:57:31 +0000 (09:57 -0700)]
doc: fix config metavariables discussion

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agodoc: perf counters
Sage Weil [Thu, 12 Jul 2012 16:51:24 +0000 (09:51 -0700)]
doc: perf counters

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agorgw: don't override subuser perm mask if perm not specified
Yehuda Sadeh [Mon, 25 Jun 2012 16:47:37 +0000 (09:47 -0700)]
rgw: don't override subuser perm mask if perm not specified

Bug #2650. We were overriding subuser perm mask whenever subuser
was modified, even if perm mask was not passed.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
13 years agodoc: added :: to code example.
John Wilkins [Thu, 12 Jul 2012 16:00:19 +0000 (09:00 -0700)]
doc: added :: to code example.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
13 years agodoc: minor edits.
John Wilkins [Thu, 12 Jul 2012 15:55:15 +0000 (08:55 -0700)]
doc: minor edits.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
13 years agodoc: cookbook name change broke some things in doc. Fixed.
John Wilkins [Thu, 12 Jul 2012 15:47:47 +0000 (08:47 -0700)]
doc: cookbook name change broke some things in doc. Fixed.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
13 years agodebian: fix ceph-fs-common-dbg depends
James Page [Wed, 11 Jul 2012 18:34:21 +0000 (11:34 -0700)]
debian: fix ceph-fs-common-dbg depends

Signed-off-by: James Page <james.page@ubuntu.com>
13 years agorados tool: bulk objects removal
Yehuda Sadeh [Wed, 11 Jul 2012 18:34:21 +0000 (11:34 -0700)]
rados tool: bulk objects removal

Issue #2776. Allow the removal of multiple objects in a single
rados tool command:

  # rados -p pool rm obj1 [obj2 [...]]

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
13 years agoMerge remote-tracking branch 'gh/wip-cct'
Sage Weil [Thu, 12 Jul 2012 02:59:32 +0000 (19:59 -0700)]
Merge remote-tracking branch 'gh/wip-cct'

Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
13 years agoMerge branch 'next'
Sage Weil [Thu, 12 Jul 2012 01:56:00 +0000 (18:56 -0700)]
Merge branch 'next'

Conflicts:
src/rados.cc

13 years agorados: more usage cleanup
Sage Weil [Thu, 12 Jul 2012 01:54:30 +0000 (18:54 -0700)]
rados: more usage cleanup

Signed-off-by: Sage Weil <sage@inktank.com>
13 years ago rados: usage message
Dan Mick [Wed, 11 Jul 2012 22:26:30 +0000 (15:26 -0700)]
rados: usage message
    Bad linebreaks, wrapping, stringification, missing doc for bench args

Signed-off-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
13 years agodoc: changed role file names as part of update to roles.
John Wilkins [Thu, 12 Jul 2012 00:35:38 +0000 (17:35 -0700)]
doc: changed role file names as part of update to roles.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
13 years agodoc: added DHO config.
John Wilkins [Thu, 12 Jul 2012 00:35:01 +0000 (17:35 -0700)]
doc: added DHO config.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
13 years agorados tool: remove -t param option for target pool
Yehuda Sadeh [Wed, 11 Jul 2012 18:52:24 +0000 (11:52 -0700)]
rados tool: remove -t param option for target pool

Bug #2772. This fixes an issue that was introduced when we
added the 'rados cp' command. The -t param was already used
for rados bench. With this change the only way to specify
a target pool is using --target-pool.
Though this problem is post argonaut, the 'rados cp' command
has been backported, so we need this fix there too.

Backport: argonaut

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
13 years agocrush: sum and check quantized weights for bucket
Sage Weil [Wed, 11 Jul 2012 23:36:47 +0000 (16:36 -0700)]
crush: sum and check quantized weights for bucket

Sum the quantized weights for each bucket, and check that for overflow.

This could change the results of a compile marginally if the map is using
non-divisible weight values that quantize funny.  The old code might
calculate a bucket sum that is not the actual sum of the quantized weights.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agocrush: Set maximum device/bucket weights.
caleb miles [Wed, 11 Jul 2012 23:03:44 +0000 (16:03 -0700)]
crush: Set maximum device/bucket weights.

Signed-off-by: caleb miles <caleb.miles@inktank.com>
13 years agocrush: prevent integer overflow on reweight
caleb miles [Wed, 11 Jul 2012 23:03:13 +0000 (16:03 -0700)]
crush: prevent integer overflow on reweight

Disallow setting OSD weights to a value over 10,000 and cap bucket weight
at 10,000,000 in a CRUSH map. Addresses issue #2101.

Signed-off-by: caleb miles <caleb.miles@inktank.com>
13 years ago rados: usage message
Dan Mick [Wed, 11 Jul 2012 22:26:30 +0000 (15:26 -0700)]
rados: usage message
    Bad linebreaks, wrapping, stringification, missing doc for bench args

Signed-off-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
13 years agoMakefile: don't install crush headers
Sage Weil [Wed, 11 Jul 2012 16:19:00 +0000 (09:19 -0700)]
Makefile: don't install crush headers

This is leftover from when we built a libcrush.so.  We can re-add when we
start doing that again.

Reported-by: Laszlo Boszormenyi <gcs@debian.hu>
Signed-off-by: Sage Weil <sage@inktank.com>
13 years agolibrados: simplify cct refcounting
Sage Weil [Wed, 11 Jul 2012 16:04:50 +0000 (09:04 -0700)]
librados: simplify cct refcounting

get() in ctor, put() in dtor.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agolockdep: stop lockdep when its cct goes away
Sage Weil [Wed, 11 Jul 2012 15:58:22 +0000 (08:58 -0700)]
lockdep: stop lockdep when its cct goes away

When a cct is destroyed, tell lockdep so that it can shut down if it needed
it.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agomon: simplify logmonitor check_subs; less noise
Sage Weil [Tue, 10 Jul 2012 00:24:19 +0000 (17:24 -0700)]
mon: simplify logmonitor check_subs; less noise

 * simple helper to translate name to id
 * verify sub type is valid in caller
 * assert sub type is valid in method
 * simplify iterator usage

Among other things, this gets rid of this noise in the logs:

2012-07-10 20:51:42.617152 7facb23f1700  1 mon.a@1(peon).log v310 check_sub sub monmap not log type

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoMerge branch 'stable' into next
Sage Weil [Wed, 11 Jul 2012 01:21:29 +0000 (18:21 -0700)]
Merge branch 'stable' into next

13 years agoosd: guard class call decoding
Sage Weil [Mon, 9 Jul 2012 20:22:42 +0000 (13:22 -0700)]
osd: guard class call decoding

Backport: argonaut
Signed-off-by: Sage Weil <sage@inktank.com>
13 years agotest_stress_watch: just one librados instance
Sage Weil [Tue, 10 Jul 2012 03:54:19 +0000 (20:54 -0700)]
test_stress_watch: just one librados instance

This was creating a new cluster connection/session per iteration, and
along with it a few service threads and sockets and so forth.

Unfortunately, librados leaks like a sieve, starting with CephContext
and ceph::crypto::init().  See #845 and #2067.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoReplicatedPG: don't warn if backfill peer stats don't match
Samuel Just [Tue, 10 Jul 2012 00:57:03 +0000 (17:57 -0700)]
ReplicatedPG: don't warn if backfill peer stats don't match

pinfo.stats might be wrong if we did log-based recovery on the
backfilled portion in addition to continuing backfill.

bug #2750

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agolibrados: take lock when signaling notify cond
Sage Weil [Fri, 6 Jul 2012 01:08:58 +0000 (18:08 -0700)]
librados: take lock when signaling notify cond

When we are signaling the cond to indicate that a notify is complete,
take the appropriate lock.  This removes the possibility of a race
that loses our signal.  (That would be very difficult given that there
are network round trips involved, but this makes the lock/cond usage
"correct.")

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoclient: fix locking for SafeCond users
Sage Weil [Wed, 4 Jul 2012 22:11:21 +0000 (15:11 -0700)]
client: fix locking for SafeCond users

Need to wait on flock, not client_lock.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agodoc: No ssh -t -t, forcing a pty allocation there makes it hang.
Tommi Virtanen [Tue, 10 Jul 2012 23:11:33 +0000 (16:11 -0700)]
doc: No ssh -t -t, forcing a pty allocation there makes it hang.

Earlier, this was a single -t, and that is overridden by the fact that
stdin is not a tty, so that did nothing.

Signed-off-by: Tommi Virtanen <tv@inktank.com>
13 years agodoc: removed the ceph directory per tommi's update to the chef-cookbooks.
John Wilkins [Tue, 10 Jul 2012 23:03:05 +0000 (16:03 -0700)]
doc: removed the ceph directory per tommi's update to the chef-cookbooks.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
13 years agodoc: Adding apt update message. VM users didn't get the package otherwise.
John Wilkins [Tue, 10 Jul 2012 22:23:56 +0000 (15:23 -0700)]
doc: Adding apt update message. VM users didn't get the package otherwise.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
13 years agoMerge branch 'wip-rbd-clone-dmick' into master
Dan Mick [Tue, 10 Jul 2012 21:04:59 +0000 (14:04 -0700)]
Merge branch 'wip-rbd-clone-dmick' into master

Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
13 years agoosd: guard class call decoding
Sage Weil [Mon, 9 Jul 2012 20:22:42 +0000 (13:22 -0700)]
osd: guard class call decoding

Backport: argonaut
Signed-off-by: Sage Weil <sage@inktank.com>
13 years agorbd: update manpage for clone command
Dan Mick [Tue, 10 Jul 2012 20:09:14 +0000 (13:09 -0700)]
rbd: update manpage for clone command

Signed-off-by: Dan Mick <dan.mick@inktank.com>
13 years agorbd: update cli test reference files
Dan Mick [Tue, 10 Jul 2012 19:51:26 +0000 (12:51 -0700)]
rbd: update cli test reference files

Signed-off-by: Dan Mick <dan.mick@inktank.com>
13 years agolibrados: pool_get_name handles "not found" wrong
Dan Mick [Tue, 10 Jul 2012 03:11:21 +0000 (20:11 -0700)]
librados: pool_get_name handles "not found" wrong

Signed-off-by: Dan Mick <dan.mick@inktank.com>
13 years agorbd, librbd: add tests for cloning
Dan Mick [Mon, 9 Jul 2012 22:43:36 +0000 (15:43 -0700)]
rbd, librbd: add tests for cloning

Signed-off-by: Dan Mick <dan.mick@inktank.com>
13 years agolibrbd, rbd, rbd.py: Add parent info reporting
Dan Mick [Mon, 9 Jul 2012 22:05:38 +0000 (15:05 -0700)]
librbd, rbd, rbd.py: Add parent info reporting

split out new parent info into separate retrieval methods;
structure packing on rbd_image_info_t was becoming a problem.
Deprecate old parent fields in favor of new ones.

Signed-off-by: Dan Mick <dan.mick@inktank.com>
13 years agorbd, librbd, rbd.py: cloning (copy-on-write child image of snapshot)
Dan Mick [Mon, 9 Jul 2012 21:55:35 +0000 (14:55 -0700)]
rbd, librbd, rbd.py: cloning (copy-on-write child image of snapshot)

Signed-off-by: Dan Mick <dan.mick@inktank.com>
13 years agolibrbd: open_image snapshot handling
Dan Mick [Mon, 9 Jul 2012 21:42:57 +0000 (14:42 -0700)]
librbd: open_image snapshot handling
Allow opening with no snap, but check for error for nonexistent snap

Backport: argonaut
Signed-off-by: Dan Mick <dan.mick@inktank.com>
13 years agolibrados: Add mapping from pool id to pool name and ioctx to rados client
Josh Durgin [Tue, 26 Jun 2012 15:58:15 +0000 (08:58 -0700)]
librados: Add mapping from pool id to pool name and ioctx to rados client

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
13 years agolibrados: add new constructor to form a Rados object from IoCtx
Dan Mick [Mon, 9 Jul 2012 21:11:23 +0000 (14:11 -0700)]
librados: add new constructor to form a Rados object from IoCtx

This creates a separate reference to an existing connection, for
use when a client holding IoCtx needs to consult another (say,
for rbd cloning)

Signed-off-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
13 years agotest_stress_watch: just one librados instance
Sage Weil [Tue, 10 Jul 2012 03:54:19 +0000 (20:54 -0700)]
test_stress_watch: just one librados instance

This was creating a new cluster connection/session per iteration, and
along with it a few service threads and sockets and so forth.

Unfortunately, librados leaks like a sieve, starting with CephContext
and ceph::crypto::init().  See #845 and #2067.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agodoc: added cookbook path instruction.
John Wilkins [Tue, 10 Jul 2012 18:04:31 +0000 (11:04 -0700)]
doc: added cookbook path instruction.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
13 years agodoc: Added some pre-clarification for gdisk. Added DHO OSD hardware config.
John Wilkins [Tue, 10 Jul 2012 15:14:42 +0000 (08:14 -0700)]
doc: Added some pre-clarification for gdisk. Added DHO OSD hardware config.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
13 years agoCephContext: don't leak admin socket
Sage Weil [Tue, 10 Jul 2012 04:36:25 +0000 (21:36 -0700)]
CephContext: don't leak admin socket

Signed-off-by: Sage Weil <sage@inktank.com>