]> git.apps.os.sepia.ceph.com Git - ceph.git/log
ceph.git
13 years agoosd: peering: make Incomplete a Peering substate
Sage Weil [Fri, 27 Jul 2012 23:03:26 +0000 (16:03 -0700)]
osd: peering: make Incomplete a Peering substate

This allows us to still catch changes in the prior set that would affect
our conclusions (that we are incomplete) and, when they happen, restart
peering.

Consider:
 - calc prior set, osd A is down
 - query everyone else, no good info
 - set down, go to Incomplete (previously WaitActingChange) state.
 - osd A comes back up (we do nothing)
 - osd A sends notify message with good info (we ignore)

By making this a Peering substate, we catch the Peering AdvMap reaction,
which will notice a prior set down osd is now up and move to Reset.

Fixes: #2860
Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoosd: peering: move to Incomplete when.. incomplete
Sage Weil [Fri, 27 Jul 2012 22:39:40 +0000 (15:39 -0700)]
osd: peering: move to Incomplete when.. incomplete

PG::choose_acting() may return false and *not* request an acting set change
if it can't find any suitable peers with enough info to recover.  In that
case, we should move to Incomplete, not WaitActingChange, just like we do
a bit lower in GetLog() if we have non-contiguous logs.  The state name is
more accurate, and this is also needed to fix bug #2860.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoMerge remote-tracking branch 'gh/stable' into stable-next
Sage Weil [Fri, 27 Jul 2012 21:00:52 +0000 (14:00 -0700)]
Merge remote-tracking branch 'gh/stable' into stable-next

13 years agoosd: fixing sharing of past_intervals on backfill restart
Sage Weil [Thu, 26 Jul 2012 23:35:00 +0000 (16:35 -0700)]
osd: fixing sharing of past_intervals on backfill restart

We need to share past_intervals whenever we instantiate the PG on a peer.
In the PG activation case, this is based on whether our peer_info[] value
for that peer is dne().  However, the backfill code was updating the
peer info (history) in the block preceeding the dne() check, which meant
we never shared past_intervals in this case and the peer would have to
chew through a potentially large number of maps if the PG has not been
clean recently.

Fix by checking dne() prior to the backfill block.  We still need to fill
in the message later because it isn't yet instantiated.

Fixes: #2849
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
13 years agoMerge remote-tracking branch 'gh/wip-rbd-bid' into stable-next
Sage Weil [Thu, 26 Jul 2012 22:04:12 +0000 (15:04 -0700)]
Merge remote-tracking branch 'gh/wip-rbd-bid' into stable-next

13 years agomon: make 'ceph osd rm ...' wipe out all state bits, not just EXISTS
Sage Weil [Mon, 23 Jul 2012 17:47:10 +0000 (10:47 -0700)]
mon: make 'ceph osd rm ...' wipe out all state bits, not just EXISTS

This ensures that when a new osd reclaims that id it behaves as if it were
really new.

Backport: argonaut
Signed-off-by: Sage Weil <sage@inktank.com>
13 years agotest_stress_watch: just one librados instance
Sage Weil [Tue, 10 Jul 2012 03:54:19 +0000 (20:54 -0700)]
test_stress_watch: just one librados instance

This was creating a new cluster connection/session per iteration, and
along with it a few service threads and sockets and so forth.

Unfortunately, librados leaks like a sieve, starting with CephContext
and ceph::crypto::init().  See #845 and #2067.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoMerge commit '35b13266923f8095650f45562d66372e618c8824' into stable-next
Sage Weil [Thu, 26 Jul 2012 22:03:50 +0000 (15:03 -0700)]
Merge commit '35b13266923f8095650f45562d66372e618c8824' into stable-next

First batch of msgr fixes.

13 years agoReplicatedPG: fix replay op ordering
Samuel Just [Mon, 9 Jul 2012 22:53:31 +0000 (15:53 -0700)]
ReplicatedPG: fix replay op ordering

After a client reconnect, the client replays outstanding ops.  The
OSD then immediately responds with success if the op has already
committed (version < ReplicatedPG::get_first_in_progress).
Otherwise, we stick it in waiting_for_ondisk to be replied to when
eval_repop concludes that waitfor_disk is empty.

Fixes #2508

Signed-off-by: Samuel Just <sam.just@inktank.com>
Conflicts:

src/osd/ReplicatedPG.cc

13 years agoobjecter: always resend linger registrations
Sage Weil [Wed, 18 Jul 2012 19:55:35 +0000 (12:55 -0700)]
objecter: always resend linger registrations

If a linger op (watch) is sent to the OSD and updates the object, and then
the client loses the reply, it will resend the request.  The OSD will see
that it is a dup, however, and not set up the in-memory session state for
the watch.  This in turn will break the watch (i.e., notifies won't
get delivered).

Instead, always resend linger registration ops, so that we always have a
unique reqid and do the correct session registeration for each session.

 * track the tid of the registation op for each LingerOp
 * mark registrations ops as should_resend=false; cancel as needed
 * when we send a new registration op, cancel the old one to ensure we
   ignore the reply.  This is needed becuase we resend linger ops on any
   pg change, not just a primary change.
 * drop the first_send arg to send_linger(), as we can now infer that
   from register_tid == 0.

The bug was easily reproduced with ms inject socket failures = 500 and the
test_stress_watch utility.

Fixes: #2796
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
13 years agoosd: guard class call decoding
Sage Weil [Mon, 9 Jul 2012 20:22:42 +0000 (13:22 -0700)]
osd: guard class call decoding

Backport: argonaut
Signed-off-by: Sage Weil <sage@inktank.com>
13 years agolibrados: take lock when signaling notify cond
Sage Weil [Fri, 6 Jul 2012 01:08:58 +0000 (18:08 -0700)]
librados: take lock when signaling notify cond

When we are signaling the cond to indicate that a notify is complete,
take the appropriate lock.  This removes the possibility of a race
that loses our signal.  (That would be very difficult given that there
are network round trips involved, but this makes the lock/cond usage
"correct.")

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoworkqueue: kick -> wake or _wake, depending on locking
Sage Weil [Sun, 22 Jul 2012 14:46:11 +0000 (07:46 -0700)]
workqueue: kick -> wake or _wake, depending on locking

Break kick() into wake() and _wake() methods, depending on whether the
lock is already held.  (The rename ensures that we audit/fix all
callers.)

Signed-off-by: Sage Weil <sage@inktank.com>
Conflicts:

src/common/WorkQueue.h
src/osd/OSD.cc

13 years agoclient: fix locking for SafeCond users
Sage Weil [Wed, 4 Jul 2012 22:11:21 +0000 (15:11 -0700)]
client: fix locking for SafeCond users

Need to wait on flock, not client_lock.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agofilestore: check for EIO in read path
Sage Weil [Thu, 26 Jul 2012 22:01:05 +0000 (15:01 -0700)]
filestore: check for EIO in read path

Check for EIO in read methods and helpers.  Try to do checks in low-level
methods (e.g., lfn_*()) to avoid duplication in higher-level methods.

The transaction apply function already checks for EIO on writes, and will
generate a nicer error message, so we can largely ignore the write path,
as long as errors get passed up correctly.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agofilestore: add 'filestore fail eio' option, default true
Sage Weil [Thu, 26 Jul 2012 16:07:46 +0000 (09:07 -0700)]
filestore: add 'filestore fail eio' option, default true

By default we will assert/fail/crash on EIO from the underlying fs.  We
already do this in the write path, but not the read path, or in various
internal infrastructure.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoconfig: fix 'config set' admin socket command
Sage Weil [Tue, 24 Jul 2012 20:53:03 +0000 (13:53 -0700)]
config: fix 'config set' admin socket command

Fixes: #2832
Backport: argonaut
Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoosd: break potentially large transaction into pieces
Sage Weil [Wed, 25 Jul 2012 23:35:09 +0000 (16:35 -0700)]
osd: break potentially large transaction into pieces

We do a similar trick elsewhere.  Control this via a tunable.  Eventually
we'll control the others (in a non-stable branch).

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoosd: only commit past intervals at end of parallel build
Sage Weil [Wed, 25 Jul 2012 21:53:34 +0000 (14:53 -0700)]
osd: only commit past intervals at end of parallel build

We don't check for gaps in the past intervals, so we should only commit
this when we are completely done.  Otherwise a partial run and rsetart will
leave the gap in place, which may confuse the peering code that relies on
this information.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoosd: generate past intervals in parallel on boot
Sage Weil [Wed, 25 Jul 2012 17:57:35 +0000 (10:57 -0700)]
osd: generate past intervals in parallel on boot

Even though we aggressively share past_intervals with notifies etc, it is
still possible for an osd to get buried behind a pile of old maps and need
to generate these if it has been out of the cluster for a while.  This has
happened to us in the past but, sadly, we did not merge the work then.
On the bright side, this implementation is much much much cleaner than the
old one because of the pg_interval_t helper we've since switched to.

On bootup, we look at the intervals each pg needs and calclate the union,
and then iterate over that map range.  The inner bit of the loop is
functionally identical to PG::build_past_intervals(), keeping the per-pg
state in the pistate struct.

Backport: argonaut
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
13 years agoosd: move calculation of past_interval range into helper
Sage Weil [Wed, 25 Jul 2012 17:58:07 +0000 (10:58 -0700)]
osd: move calculation of past_interval range into helper

PG::generate_past_intervals() first calculates the range over which it
needs to generate past intervals.  Do this in a helper function.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
13 years agoosd: fix map epoch boot condition
Sage Weil [Wed, 25 Jul 2012 17:58:28 +0000 (10:58 -0700)]
osd: fix map epoch boot condition

We only want to join the cluster if we can catch up to the latest
osdmap with a small number of maps, in this case a single map message.

Backport: argonaut
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
13 years agomon: ignore pgtemp messages from down osds
Sage Weil [Wed, 25 Jul 2012 03:18:01 +0000 (20:18 -0700)]
mon: ignore pgtemp messages from down osds

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agomon: ignore osd_alive messages from down osds
Sage Weil [Wed, 25 Jul 2012 03:16:04 +0000 (20:16 -0700)]
mon: ignore osd_alive messages from down osds

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agolibrbd: replace assign_bid with client id and random number
Josh Durgin [Mon, 23 Jul 2012 21:05:53 +0000 (14:05 -0700)]
librbd: replace assign_bid with client id and random number

The assign_bid method has issues with replay because it is a write
that also returns data. This means that the replayed operation would
return success, but no data, and cause a create to fail. Instead, let
the client set the bid based on its global id and a random number.

This only affects the creation of new images, since the bid is put
into an opaque string as part of the object prefix.

Keep the server side assign_bid around in case there are old clients
still using it.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
13 years agolibrados: add new constructor to form a Rados object from IoCtx
Dan Mick [Mon, 9 Jul 2012 21:11:23 +0000 (14:11 -0700)]
librados: add new constructor to form a Rados object from IoCtx

This creates a separate reference to an existing connection, for
use when a client holding IoCtx needs to consult another (say,
for rbd cloning)

Signed-off-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
13 years agoadd CRUSH_TUNABLES feature bit
Sage Weil [Thu, 19 Jul 2012 02:49:58 +0000 (19:49 -0700)]
add CRUSH_TUNABLES feature bit

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoObjectCacher: fix cache_bytes_hit accounting
Josh Durgin [Wed, 18 Jul 2012 17:24:58 +0000 (10:24 -0700)]
ObjectCacher: fix cache_bytes_hit accounting

Misses are not hits!

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
13 years agoRobustify ceph-rbdnamer and adapt udev rules
Pascal de Bruijn | Unilogic Networks B.V [Wed, 11 Jul 2012 13:23:16 +0000 (15:23 +0200)]
Robustify ceph-rbdnamer and adapt udev rules

Below is a patch which makes the ceph-rbdnamer script more robust and
fixes a problem with the rbd udev rules.

On our setup we encountered a symlink which was linked to the wrong rbd:

  /dev/rbd/mypool/myrbd -> /dev/rbd1

While that link should have gone to /dev/rbd3 (on which a
partition /dev/rbd3p1 was present).

Now the old udev rule passes %n to the ceph-rbdnamer script, the problem
with %n is that %n results in a value of 3 (for rbd3), but in a value of
1 (for rbd3p1), so it seems it can't be depended upon for rbdnaming.

In the patch below the ceph-rbdnamer script is made more robust and it
now it can be called in various ways:

  /usr/bin/ceph-rbdnamer /dev/rbd3
  /usr/bin/ceph-rbdnamer /dev/rbd3p1
  /usr/bin/ceph-rbdnamer rbd3
  /usr/bin/ceph-rbdnamer rbd3p1
  /usr/bin/ceph-rbdnamer 3

Even with all these different styles of calling the modified script, it
should now return the same rbdname. This change "has" to be combined
with calling it from udev with %k though.

With that fixed, we hit the second problem. We ended up with:

  /dev/rbd/mypool/myrbd -> /dev/rbd3p1

So the rbdname was symlinked to the partition on the rbd instead of the
rbd itself. So what probably went wrong is udev discovering the disk and
running ceph-rbdnamer which resolved it to myrbd so the following
symlink was created:

  /dev/rbd/mypool/myrbd -> /dev/rbd3

However partitions would be discovered next and ceph-rbdnamer would be
run with rbd3p1 (%k) as parameter, resulting in the name myrbd too, with
the previous correct symlink being overwritten with a faulty one:

  /dev/rbd/mypool/myrbd -> /dev/rbd3p1

The solution to the problem is in differentiating between disks and
partitions in udev and handling them slightly differently. So with the
patch below partitions now get their own symlinks in the following style
(which is fairly consistent with other udev rules):

  /dev/rbd/mypool/myrbd-part1 -> /dev/rbd3p1

Please let me know any feedback you have on this patch or the approach
used.

Regards,
Pascal de Bruijn
Unilogic B.V.

Signed-off-by: Pascal de Bruijn <pascal@unilogicnetworks.net>
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
13 years agolog: apply log_level to stderr/syslog logic
Sage Weil [Mon, 16 Jul 2012 23:02:14 +0000 (16:02 -0700)]
log: apply log_level to stderr/syslog logic

In non-crash situations, we want to make sure the message is both below the
syslog/stderr threshold and also below the normal log threshold.  Otherwise
we get anything we gather on those channels, even when the log level is
low.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agolog: fix event gather condition
Sage Weil [Mon, 16 Jul 2012 22:40:53 +0000 (15:40 -0700)]
log: fix event gather condition

We should gather an event if it is below the log or gather threshold.

Previously we were only gathering if we were going to print it, which makes
the dump no more useful than what was already logged.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoPG::RecoveryState::Stray::react(LogEvt&): reset last_pg_scrub
Samuel Just [Mon, 16 Jul 2012 20:11:24 +0000 (13:11 -0700)]
PG::RecoveryState::Stray::react(LogEvt&): reset last_pg_scrub

We need to reset the last_pg_scrub data in the osd since we
are replacing the info.

Probably fixes #2453

In cases like 2453, we hit the following backtrace:

     0> 2012-05-19 17:24:09.113684 7fe66be3d700 -1 osd/OSD.h: In function 'void OSD::unreg_last_pg_scrub(pg_t, utime_t)' thread 7fe66be3d700 time 2012-05-19 17:24:09.095719
osd/OSD.h: 840: FAILED assert(last_scrub_pg.count(p))

 ceph version 0.46-313-g4277d4d (commit:4277d4d3378dde4264e2b8d211371569219c6e4b)
 1: (OSD::unreg_last_pg_scrub(pg_t, utime_t)+0x149) [0x641f49]
 2: (PG::proc_primary_info(ObjectStore::Transaction&, pg_info_t const&)+0x5e) [0x63383e]
 3: (PG::RecoveryState::ReplicaActive::react(PG::RecoveryState::MInfoRec const&)+0x4a) [0x633eda]
 4: (boost::statechart::detail::reaction_result boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl<boost::mpl::list3<boost::statechart::custom_reaction<PG::RecoveryState::MQuery>, boost::statechart::custom_reaction<PG::RecoveryState::MInfoRec>, boost::statechart::custom_reaction<PG::RecoveryState::MLogRec> >, boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0> >(boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>&, boost::statechart::event_base const&, void const*)+0x130) [0x6466a0]
 5: (boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x81) [0x646791]
 6: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x63dfcb]
 7: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x11) [0x63e0f1]
 8: (PG::RecoveryState::handle_info(int, pg_info_t&, PG::RecoveryCtx*)+0x177) [0x616987]
 9: (OSD::handle_pg_info(std::tr1::shared_ptr<OpRequest>)+0x665) [0x5d3d15]
 10: (OSD::dispatch_op(std::tr1::shared_ptr<OpRequest>)+0x2a0) [0x5d7370]
 11: (OSD::_dispatch(Message*)+0x191) [0x5dd4a1]
 12: (OSD::ms_dispatch(Message*)+0x153) [0x5ddda3]
 13: (SimpleMessenger::dispatch_entry()+0x863) [0x77fbc3]
 14: (SimpleMessenger::DispatchThread::entry()+0xd) [0x746c5d]
 15: (()+0x7efc) [0x7fe679b1fefc]
 16: (clone()+0x6d) [0x7fe67815089d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Because we don't clear the scrub state before reseting info,
the last_scrub_stamp state in the info.history structure
changes without updating the osd state resulting in the
above assert failure.

Backport: stable

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agoReplicatedPG: don't warn if backfill peer stats don't match
Samuel Just [Tue, 10 Jul 2012 00:57:03 +0000 (17:57 -0700)]
ReplicatedPG: don't warn if backfill peer stats don't match

pinfo.stats might be wrong if we did log-based recovery on the
backfilled portion in addition to continuing backfill.

bug #2750

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agomon/MonitorStore: always O_TRUNC when writing states
Sage Weil [Mon, 16 Jul 2012 03:30:34 +0000 (20:30 -0700)]
mon/MonitorStore: always O_TRUNC when writing states

It is possible for a .new file to already exist, potentially with a
larger size.  This would happen if:

 - we were proposing a different value
 - we crashed (or were stopped) before it got renamed into place
 - after restarting, a different value was proposed and accepted.

This isn't so unlikely for the log state machine, where we're
aggregating random messages.  O_TRUNC ensure we avoid getting the tail
end of some previous junk.

I observed #2593 and found that a logm state value had a larger size on
one mon (after slurping) than the others, pointing to put_bl_sn_map().

While we are at it, O_TRUNC put_int() too; the same type of bug is
possible there, too.

Fixes: #2593
Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoosd: based misdirected op role calc on acting set
Sage Weil [Sat, 14 Jul 2012 21:31:34 +0000 (14:31 -0700)]
osd: based misdirected op role calc on acting set

We want to look at the acting set here, nothing else.  This was causing us
to erroneously queue ops for later (wasting memory) and to erroneously
print out a 'misdrected op' message in the cluster log (confusion and
incorrect [but ignored] -ENXIO reply).

Fixes: #2022
Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoqa: download tests from specified branch
Josh Durgin [Fri, 13 Jul 2012 16:42:20 +0000 (09:42 -0700)]
qa: download tests from specified branch

These python tests aren't installed, so they need to be downloaded

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
13 years agorgw: don't override subuser perm mask if perm not specified
Yehuda Sadeh [Mon, 25 Jun 2012 16:47:37 +0000 (09:47 -0700)]
rgw: don't override subuser perm mask if perm not specified

Bug #2650. We were overriding subuser perm mask whenever subuser
was modified, even if perm mask was not passed.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
13 years agodebian: fix ceph-fs-common-dbg depends
James Page [Wed, 11 Jul 2012 18:34:21 +0000 (11:34 -0700)]
debian: fix ceph-fs-common-dbg depends

Signed-off-by: James Page <james.page@ubuntu.com>
13 years agorados tool: remove -t param option for target pool
Yehuda Sadeh [Wed, 11 Jul 2012 18:52:24 +0000 (11:52 -0700)]
rados tool: remove -t param option for target pool

Bug #2772. This fixes an issue that was introduced when we
added the 'rados cp' command. The -t param was already used
for rados bench. With this change the only way to specify
a target pool is using --target-pool.
Though this problem is post argonaut, the 'rados cp' command
has been backported, so we need this fix there too.

Backport: argonaut

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
13 years agoMakefile: don't install crush headers
Sage Weil [Wed, 11 Jul 2012 16:19:00 +0000 (09:19 -0700)]
Makefile: don't install crush headers

This is leftover from when we built a libcrush.so.  We can re-add when we
start doing that again.

Reported-by: Laszlo Boszormenyi <gcs@debian.hu>
Signed-off-by: Sage Weil <sage@inktank.com>
13 years agomsgr: take over existing Connection on Pipe replacement
Sage Weil [Tue, 10 Jul 2012 20:18:27 +0000 (13:18 -0700)]
msgr: take over existing Connection on Pipe replacement

If a new pipe/socket is taking over an existing session, it should also
take over the Connection* associated with the existing session.  Because
we cannot clear existing->connection_state, we just take another reference.

Clean up the comments a bit while we're here.

This affects MDS<->client sessions when reconnecting after a socket fault.
It probably also affects intra-cluster (osd/osd, mds/mds, mon/mon)
sessions as well, but I did not confirm that.

Backport: argonaut
Signed-off-by: Sage Weil <sage@inktank.com>
13 years agodebian: include librados-config in librados-dev
Sage Weil [Mon, 9 Jul 2012 03:33:12 +0000 (20:33 -0700)]
debian: include librados-config in librados-dev

Reported-by: Laszlo Boszormenyi <gcs@debian.hu>
Signed-off-by: Sage Weil <sage@inktank.com>
13 years agolockdep: increase max locks
Sage Weil [Tue, 3 Jul 2012 20:04:28 +0000 (13:04 -0700)]
lockdep: increase max locks

Hit this limit with the rados api tests.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoconfig: add unlocked version of get_my_sections; use it internally
Sage Weil [Tue, 3 Jul 2012 19:07:28 +0000 (12:07 -0700)]
config: add unlocked version of get_my_sections; use it internally

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoconfig: fix lock recursion in get_val_from_conf_file()
Sage Weil [Tue, 3 Jul 2012 15:20:06 +0000 (08:20 -0700)]
config: fix lock recursion in get_val_from_conf_file()

Introduce a private, already-locked version.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoconfig: fix recursive lock in parse_config_files()
Sage Weil [Tue, 3 Jul 2012 15:15:08 +0000 (08:15 -0700)]
config: fix recursive lock in parse_config_files()

The _impl() helper is only called from parse_config_files(); don't retake
the lock.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agorgw: initialize fields of RGWObjEnt
Sage Weil [Wed, 4 Jul 2012 01:51:02 +0000 (18:51 -0700)]
rgw: initialize fields of RGWObjEnt

This fixes various valgrind warnings triggered by the s3test
test_object_create_unreadable.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agorgw: handle response-* params
Yehuda Sadeh [Fri, 6 Jul 2012 20:14:53 +0000 (13:14 -0700)]
rgw: handle response-* params

Handle response-* params that set response header field values.
Fixes #2734, #2735.
Backport: argonaut

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
13 years agoosd: add missing formatter close_section() to scrub status
Sage Weil [Wed, 4 Jul 2012 20:59:04 +0000 (13:59 -0700)]
osd: add missing formatter close_section() to scrub status

Also add braces to make the open/close matchups easier to see.  Broken
by f36617392710f9b3538bfd59d45fd72265993d57.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agopg: report scrub status
Mike Ryan [Wed, 27 Jun 2012 21:14:30 +0000 (14:14 -0700)]
pg: report scrub status

Signed-off-by: Mike Ryan <mike.ryan@inktank.com>
13 years agopg: track who we are waiting for maps from
Mike Ryan [Wed, 27 Jun 2012 20:30:45 +0000 (13:30 -0700)]
pg: track who we are waiting for maps from

Signed-off-by: Mike Ryan <mike.ryan@inktank.com>
13 years agopg: reduce scrub write lock window
Mike Ryan [Tue, 26 Jun 2012 23:25:27 +0000 (16:25 -0700)]
pg: reduce scrub write lock window

Wait for all replicas to construct the base scrub map before finalizing
the scrub and locking out writes.

Signed-off-by: Mike Ryan <mike.ryan@inktank.com>
13 years agorgw: don't store bucket info indexed by bucket_id
Yehuda Sadeh [Thu, 5 Jul 2012 22:52:51 +0000 (15:52 -0700)]
rgw: don't store bucket info indexed by bucket_id

Issue #2701. This info wasn't really used anywhere and we weren't
removing it. It was also sharing the same pool namespace as the
info indexed by bucket name, which is bad.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
13 years agotest_rados_tool.sh: test copy pool
Yehuda Sadeh [Thu, 5 Jul 2012 21:59:22 +0000 (14:59 -0700)]
test_rados_tool.sh: test copy pool

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
13 years agorados tool: copy object in chunks
Yehuda Sadeh [Thu, 5 Jul 2012 20:42:23 +0000 (13:42 -0700)]
rados tool: copy object in chunks

Instead of reading the entire object and then writing it,
we read it in chunks.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
13 years agorados tool: copy entire pool
Yehuda Sadeh [Fri, 29 Jun 2012 21:43:00 +0000 (14:43 -0700)]
rados tool: copy entire pool

A new rados tool command that copies an entire pool
into another existing pool.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
13 years agorados tool: copy object
Yehuda Sadeh [Fri, 29 Jun 2012 21:09:08 +0000 (14:09 -0700)]
rados tool: copy object

New rados command: rados cp <src-obj> [dest-obj]

Requires specifying source pool. Target pool and locator can be specified.
The new command preserves object xattrs and omap data.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
13 years agoceph.spec.in: add ceph-disk-{activate,prepare}
Sage Weil [Fri, 6 Jul 2012 15:47:44 +0000 (08:47 -0700)]
ceph.spec.in: add ceph-disk-{activate,prepare}

Reported-by: Jimmy Tang <jtang@tchpc.tcd.ie>
Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoAllow URL-safe base64 cephx keys to be decoded.
Wido den Hollander [Thu, 5 Jul 2012 13:29:54 +0000 (15:29 +0200)]
Allow URL-safe base64 cephx keys to be decoded.

In these cases + and / are replaced by - and _ to prevent problems when using
the base64 strings in URLs.

Signed-off-by: Wido den Hollander <wido@widodh.nl>
Signed-off-by: Sage Weil <sage@inktank.com>
13 years agolibrados: Bump the version to 0.48
Wido den Hollander [Wed, 4 Jul 2012 13:46:04 +0000 (15:46 +0200)]
librados: Bump the version to 0.48

Signed-off-by: Wido den Hollander <wido@widodh.nl>
Signed-off-by: Sage Weil <sage@inktank.com>
13 years agorgw-admin: use correct modifier with strptime
Yehuda Sadeh [Wed, 27 Jun 2012 00:28:51 +0000 (17:28 -0700)]
rgw-admin: use correct modifier with strptime

Bug #2658: used %I (12h) instead of %H (24h)

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
13 years agorgw: send both swift x-storage-token and x-auth-token
Yehuda Sadeh [Thu, 21 Jun 2012 22:40:27 +0000 (15:40 -0700)]
rgw: send both swift x-storage-token and x-auth-token

older clients need x-storage-token, newer x-auth-token

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
13 years agorgw: radosgw-admin date params now also accept time
Yehuda Sadeh [Thu, 21 Jun 2012 22:17:19 +0000 (15:17 -0700)]
rgw: radosgw-admin date params now also accept time

The date format now is "YYYY-MM-DD[ hh:mm:ss]". Got rid of
the --time param for the old ops log stuff.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
Conflicts:

src/test/cli/radosgw-admin/help.t

13 years agorgw-admin: fix usage help
Yehuda Sadeh [Thu, 21 Jun 2012 20:14:47 +0000 (13:14 -0700)]
rgw-admin: fix usage help

s/show/trim

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
13 years agoradosgw-admin: fix clit test
Sage Weil [Tue, 3 Jul 2012 21:07:16 +0000 (14:07 -0700)]
radosgw-admin: fix clit test

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoceph: fix cli help test
Sage Weil [Tue, 3 Jul 2012 18:32:57 +0000 (11:32 -0700)]
ceph: fix cli help test

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoReplicatedPG: remove faulty scrub assert in sub_op_modify_applied
Samuel Just [Tue, 3 Jul 2012 18:23:16 +0000 (11:23 -0700)]
ReplicatedPG: remove faulty scrub assert in sub_op_modify_applied

This assert assumed that all ops submitted before MOSDRepScrub was
submitted were processed by the time that MOSDRepScrub was
processed.  In fact, MOSDRepScrub's scrub_to may refer to a
last_update yet to be seen by the replica.

Bug #2693

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agoceph: better usage
Kyle Bader [Tue, 3 Jul 2012 18:20:38 +0000 (11:20 -0700)]
ceph: better usage

Signed-off-by: Kyle Bader <kyle.bader@dreamhost.com>
13 years agodebian: strip new ceph-mds package
Sage Weil [Tue, 3 Jul 2012 16:20:35 +0000 (09:20 -0700)]
debian: strip new ceph-mds package

Reported-by: Amon Ott <a.ott@m-privacy.de>
Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoconfig: remove bad argparse_flag argument in parse_option()
Sage Weil [Tue, 3 Jul 2012 13:46:10 +0000 (06:46 -0700)]
config: remove bad argparse_flag argument in parse_option()

This is wrong, and thankfully valgrind picks it up.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agomsgr: restart_queue when replacing existing pipe and taking over the queue
Sage Weil [Mon, 2 Jul 2012 00:23:28 +0000 (17:23 -0700)]
msgr: restart_queue when replacing existing pipe and taking over the queue

The queue may have been previously stopped (by discard_queue()), and needs
to be restarted.

Fixes consistent failures from the mon_recovery.py integration tests.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agomsgr: choose incoming connection if ours is STANDBY
Sage Weil [Sun, 1 Jul 2012 22:37:31 +0000 (15:37 -0700)]
msgr: choose incoming connection if ours is STANDBY

If the connect_seq matches, but our existing connection is in STANDBY, take
the incoming one.  Otherwise, the other end will wait indefinitely for us
to connect but we won't.

Alternatively, we could "win" the race and trigger a connection by sending
a keepalive (or similar), but that is more work; we may as well accept the
incoming connection we have now.

This removes STANDBY from the acceptable WAIT case states.  It also keeps
responsibility squarely on the shoulders of the peer with something to
deliver.

Without this patch, a 3-osd vstart cluster with
'ms inject socket failures = 100' and rados bench write -b 4096 would start
generating slow request warnings after a few minutes due to the osds
failing to connect to each other.  With the patch, I complete a 10 minute
run without problems.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agomsgr: preserve incoming message queue when replacing pipes
Sage Weil [Fri, 29 Jun 2012 00:50:47 +0000 (17:50 -0700)]
msgr: preserve incoming message queue when replacing pipes

If we replace an existing pipe with a new one, move the incoming queue
of messages that have not yet been dispatched over to the new Pipe so that
they are not lost.  This prevents messages from being lost.

Alternatively, we could set in_seq = existing->in_seq - existing->in_qlen,
but that would make the other end resend those messages, which is a waste
of bandwidth.

Very easy to reproduce the original bug with 'ms inject socket failures'.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agomsgr: move dispatch_entry into DispatchQueue class
Sage Weil [Fri, 29 Jun 2012 00:45:24 +0000 (17:45 -0700)]
msgr: move dispatch_entry into DispatchQueue class

A bit cleaner.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agomsgr: move incoming queue to separate class
Sage Weil [Fri, 29 Jun 2012 00:38:34 +0000 (17:38 -0700)]
msgr: move incoming queue to separate class

This extricates the incoming queue and its funky relationship with
DispatchQueue from Pipe and moves it into IncomingQueue.  There is now a
single IncomingQueue attached to each Pipe.  DispatchQueue is now no
longer tied to Pipe.

This modularizes the code a bit better (tho that is still a work in
progress) and (more importantly) will make it possible to move the
incoming messages from one pipe to another in accept().

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agomsgr: make D_CONNECT constant non-zero, fix ms_handle_connect() callback
Sage Weil [Thu, 28 Jun 2012 00:06:40 +0000 (17:06 -0700)]
msgr: make D_CONNECT constant non-zero, fix ms_handle_connect() callback

A while ago we inadvertantly broke ms_handle_connect() callbacks because
of a check for m being non-zero in the dispatch_entry() thread.  Adjust the
enums so that they get delivered again.

This fixes hangs when, for example, the ceph tool sends a command, gets a
connection reset, and doesn't get the connect callback to resend after
reconnecting to a new monitor.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agomsgr: fix pipe replacement assert
Sage Weil [Wed, 27 Jun 2012 00:10:40 +0000 (17:10 -0700)]
msgr: fix pipe replacement assert

We may replace an existing pipe in the STANDBY state if the previous
attempt failed during accept() (see previous patches).

This might fix #1378.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agomsgr: do not try to reconnect con with CLOSED pipe
Sage Weil [Wed, 27 Jun 2012 00:07:31 +0000 (17:07 -0700)]
msgr: do not try to reconnect con with CLOSED pipe

If we have a con with a closed pipe, drop the message.  For lossless
sessions, the state will be STANDBY if we should reconnect.  For lossy
sessions, we will end up with CLOSED and we *should* drop the message.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agomsgr: move to STANDBY if we replace during accept and then fail
Sage Weil [Wed, 27 Jun 2012 00:06:41 +0000 (17:06 -0700)]
msgr: move to STANDBY if we replace during accept and then fail

If we replace an existing pipe during accept() and then fail, move to
STANDBY so that our connection state (connect_seq, etc.) is preserved.
Otherwise, we will throw out that information and falsely trigger a
RESETSESSION on the next connection attempt.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agov0.48argonaut v0.48argonaut
Sage Weil [Sat, 30 Jun 2012 21:50:20 +0000 (14:50 -0700)]
v0.48argonaut

13 years agoceph.spec.in: Change license of base package to GPL and use SPDX format
Holger Macht [Mon, 2 Jul 2012 20:54:48 +0000 (13:54 -0700)]
ceph.spec.in: Change license of base package to GPL and use SPDX format

LGPLv2 in spec file is not correct, because some of the included
packages/binaries are GPLv2. For example:

 src/mount/mtab.c     -> package ceph, binary mount.ceph
 src/common/fiemap.cc -> package ceph, binary rbd

Also use SPDX format (http://www.spdx.org/licenses) for the sub-package
licenses.

Signed-off-by: Holger Macht <hmacht@suse.de>
13 years agomon: initialize quorum_features
Sage Weil [Mon, 2 Jul 2012 23:05:16 +0000 (16:05 -0700)]
mon: initialize quorum_features

This could cause us to incorrectly encode new features into the monstore
that an old mon won't understand.

This is overly conservative; we probably need to persist the set of quorum
features that are supported and use those.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoOSD::do_command: unlock pg only if we had it
Samuel Just [Mon, 2 Jul 2012 16:51:37 +0000 (09:51 -0700)]
OSD::do_command: unlock pg only if we had it

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agoMOSDSubOp: set hobject_incorrect_pool in decode_payload
Samuel Just [Mon, 2 Jul 2012 16:49:52 +0000 (09:49 -0700)]
MOSDSubOp: set hobject_incorrect_pool in decode_payload

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agofilestore: initialize m_filestore_do_dump
Sage Weil [Mon, 2 Jul 2012 14:10:33 +0000 (07:10 -0700)]
filestore: initialize m_filestore_do_dump

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoosdmap: check new pool name on rename
Sage Weil [Sat, 30 Jun 2012 02:56:07 +0000 (19:56 -0700)]
osdmap: check new pool name on rename

Ensure the new pool name doesn't already exist, both in the current and
project map.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agoosd: handle pool name changes properly
Sage Weil [Sat, 30 Jun 2012 02:54:35 +0000 (19:54 -0700)]
osd: handle pool name changes properly

 * Remove the old name from the name->id map.

Fixes: #2676
Signed-off-by: Sage Weil <sage@inktank.com>
13 years agomon: 'osd pool rename <oldname> <newname>'
Sage Weil [Fri, 29 Jun 2012 21:51:32 +0000 (14:51 -0700)]
mon: 'osd pool rename <oldname> <newname>'

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agorest-bench: mark request as complete later
Yehuda Sadeh [Wed, 27 Jun 2012 00:16:11 +0000 (17:16 -0700)]
rest-bench: mark request as complete later

We marked a request as complete in the callback, however
it might be that we're still inside S3_runall_request_context()
which means that request is not really complete yet.
Possibly fixes bug #2652.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
13 years agoDBObjectMap: clones must inherit spos from parent
Samuel Just [Thu, 28 Jun 2012 01:09:37 +0000 (18:09 -0700)]
DBObjectMap: clones must inherit spos from parent

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agofilestore: sync object_map object in lfn_remove when nlink > 1
Samuel Just [Wed, 27 Jun 2012 22:16:42 +0000 (15:16 -0700)]
filestore: sync object_map object in lfn_remove when nlink > 1

In the following sequence:

1) create (a, 1)
2) setattr (a, 1)
3) link (a, 1), (b, 1)
4) remove (a, 1)

If we play 1-4 and then replay 1-4 again, we will end up removing
(b, 1)'s attributes since nlink for (a, 1) the second time through
is 1.  We fix this by marking spos on the object_map header for
(a, 1) when we remove (a, 1) but not eh attributes.

Signed-off-by: Samuel Just <sam.just@inktank.com>
13 years agodebian: move metadata server into ceph-mds
Sage Weil [Mon, 18 Jun 2012 16:29:48 +0000 (09:29 -0700)]
debian: move metadata server into ceph-mds

Also adjust the recommends and depends, so that libcephfs1 and ceph-fuse
hang off of ceph-mds instead of ceph.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agodebian: move mount.ceph and cephfs into ceph-fs-common
Sage Weil [Mon, 18 Jun 2012 16:20:40 +0000 (09:20 -0700)]
debian: move mount.ceph and cephfs into ceph-fs-common

Based on patches from Laszlo Boszormenyi (GCS) <gcs@debian.hu>.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agodebian: arch linux-any
Sage Weil [Mon, 18 Jun 2012 16:15:56 +0000 (09:15 -0700)]
debian: arch linux-any

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agodebian: build with libnss instead of crypto++
Laszlo Boszormenyi (GCS) [Sat, 16 Jun 2012 20:39:56 +0000 (13:39 -0700)]
debian: build with libnss instead of crypto++

Signed-off-by: Laszlo Boszormenyi (GCS) <gcs@debian.hu>
13 years agodoc/config-cluster/authentication: keyring default locations, simplify key management
Sage Weil [Tue, 12 Jun 2012 19:47:57 +0000 (12:47 -0700)]
doc/config-cluster/authentication: keyring default locations, simplify key management

- keyrings have new default locations that everyone should use.
- the user key setup is vastly simplified if you use the
  'ceph auth get-or-create' command.

Signed-off-by: Sage Weil <sage@inktank.com>
13 years agomon: MonmapMonitor: Use default port when the specified on 'add' is zero
Joao Eduardo Luis [Wed, 27 Jun 2012 23:29:24 +0000 (00:29 +0100)]
mon: MonmapMonitor: Use default port when the specified on 'add' is zero

Fixes a bug triggered by using the ceph tool to 'mon add' with a port set
to zero. We now default to the monitor's default port (6789) instead, and
we will fail if that port is already assigned to some other monitor.

Fixes: bug #2661
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
13 years agoOSD: disconnect_session_watches: handle race with watch disconnect
Samuel Just [Tue, 26 Jun 2012 17:38:20 +0000 (10:38 -0700)]
OSD: disconnect_session_watches: handle race with watch disconnect

Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Tested-by: Stefan Priebe <s.priebe@profihost.ag>
13 years agomon: don't tick the PaxosServices if we are currently slurping.
Greg Farnum [Mon, 25 Jun 2012 20:04:15 +0000 (13:04 -0700)]
mon: don't tick the PaxosServices if we are currently slurping.

They aren't prepared to deal with the on-disk state being inconsistent.

Signed-off-by: Greg Farnum <greg@inktank.com>
13 years agoobjecter: do not feed session to op_submit()
Sage Weil [Wed, 20 Jun 2012 18:07:29 +0000 (11:07 -0700)]
objecter: do not feed session to op_submit()

The linger_send() method was doing this, but it is problematic because the
new Op doesn't get its pgid or acting vector set correctly.  The result is
that the request goes to the right OSD, but has the wrong pgid, and makes
the OSD complain about misdirected requests and drop it on the floor.  It
didn't affect the test results because we weren't testing whether the
watch was working in that case.

Instead, we'll just recalculate and get the same value the parent linger
op did.  Which is fine, and goes through all the usual code paths so
nothing is missed.

Also, increment num_homeless_ops before we recalc_op_target(), so that we
don't (harmlessly, but confusingly) underflow.

Fixes: #2022
Signed-off-by: Sage Weil <sage@inktank.com>