Samuel Just [Tue, 23 Sep 2014 22:52:08 +0000 (15:52 -0700)]
ReplicatedPG: don't move on to the next snap immediately
If we have a bunch of trimmed snaps for which we have no
objects, we'll spin for a long time. Instead, requeue.
Fixes: #9487
Backport: dumpling, firefly, giant Reviewed-by: Sage Weil <sage@redhat.com> Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit c17ac03a50da523f250eb6394c89cc7e93cb4659)
Sage Weil [Tue, 23 Sep 2014 23:21:33 +0000 (16:21 -0700)]
osd: initialize purged_snap on backfill start; restart backfill if change
If we backfill a PG to a new OSD, we currently neglect to initialize
purged_snaps. As a result, the first time the snaptrimmer runs it has to
churn through every deleted snap for all time, and to make matters worse
does so in one go with the PG lock held. This leads to badness on any
cluster with a significant number of removed snaps that experiences
backfill.
Resolve this by initializing purged_snaps when we finish backfill. The
backfill itself will clear out any stray snaps and ensure the object set
is in sync with purged_snaps. Note that purged_snaps on the primary
that is driving backfill will not change during this period as the
snaptrimmer is not scheduled unless the PG is clean (which it won't be
during backfill).
If we by chance to interrupt backfill, go clean with other OSDs,
purge snaps, and then let this OSD rejoin, we will either restart
backfill (non-contiguous log) or the log will include the result of
the snap trim (the events that remove the trimmed snap).
Jason Dillaman [Tue, 18 Nov 2014 02:49:26 +0000 (21:49 -0500)]
librbd: protect list_children from invalid child pool IoCtxs
While listing child images, don't ignore error codes returned
from librados when creating an IoCtx. This will prevent seg
faults from occurring when an invalid IoCtx is used.
Fixes: #10123
Backport: giant, firefly, dumpling Signed-off-by: Jason Dillaman <dillaman@redhat.com>
(cherry picked from commit 0d350b6817d7905908a4e432cd359ca1d36bab50)
Loic Dachary [Thu, 9 Oct 2014 16:52:17 +0000 (18:52 +0200)]
ceph-disk: run partprobe after zap
Not running partprobe after zapping a device can lead to the following:
* ceph-disk prepare /dev/loop2
* links are created in /dev/disk/by-partuuid
* ceph-disk zap /dev/loop2
* links are not removed from /dev/disk/by-partuuid
* ceph-disk prepare /dev/loop2
* some links are not created in /dev/disk/by-partuuid
This is assuming there is a bug in the way udev events are handled by
the operating system.
Loic Dachary [Fri, 10 Oct 2014 08:23:34 +0000 (10:23 +0200)]
ceph-disk: encapsulate partprobe / partx calls
Add the update_partition function to reduce code duplication.
The action is made an argument although it always is -a because it will
be -d when deleting a partition.
Use the update_partition function in prepare_journal_dev
Sage Weil [Mon, 20 Oct 2014 20:55:33 +0000 (13:55 -0700)]
osd: discard rank > 0 ops on erasure pools
Erasure pools do not support read from replica, so we should drop
any rank > 0 requests.
This fixes a bug where an erasure pool maps to [1,2,3], temporarily maps
to [-1,2,3], sends a request to osd.2, and then remaps back to [1,2,3].
Because the 0 shard never appears on osd.2, the request sits in the
waiting_for_pg map indefinitely and cases slow request warnings.
This problem does not come up on replicated pools because all instances of
the PG are created equal.
Fix by only considering role == 0 for erasure pools as a correct mapping.
Fixes: #9835 Signed-off-by: Sage Weil <sage@redhat.com>
Sage Weil [Thu, 13 Nov 2014 01:04:35 +0000 (17:04 -0800)]
osd/OSDMap: add osd_is_valid_op_target()
Helper to check whether an osd is a given op target for a pg. This
assumes that for EC we always send ops to the primary, while for
replicated we may target any replica.
Josh Durgin [Wed, 12 Nov 2014 02:16:02 +0000 (18:16 -0800)]
qa: allow small allocation diffs for exported rbds
The local filesytem may behave slightly differently. This isn't
foolproof, but seems to be reliable enough on rhel7 rootfs, where
exact comparison was failing.
Sage Weil [Sun, 25 May 2014 15:38:38 +0000 (08:38 -0700)]
osd: fix map advance limit to handle map gaps
The recent change in cf25bdf6b0090379903981fe8cee5ea75efd7ba0 would stop
advancing after some number of epochs, but did not take into consideration
the possibilty that there are missing maps. In that case, it is impossible
to advance past the gap.
Fix this by increasing the max epoch as we go so that we can always get
beyond the gap.
John Spray [Fri, 7 Nov 2014 11:34:43 +0000 (11:34 +0000)]
tools: fix MDS journal import
Previously it only worked on fresh filesystems which
hadn't been trimmed yet, and resulted in an invalid
trimmed_pos when expire_pos wasn't on an object
boundary.
Sage Weil [Mon, 15 Sep 2014 22:29:08 +0000 (15:29 -0700)]
ceph-disk: mount xfs with inode64 by default
We did this forever ago with mkcephfs, but ceph-disk didn't. Note that for
modern XFS this option is obsolete, but for older kernels it was not the
default.
Yehuda Sadeh [Thu, 9 Oct 2014 17:20:27 +0000 (10:20 -0700)]
rgw: set length for keystone token validation request
Fixes: #7796
Backport: giany, firefly
Need to set content length to this request, as the server might not
handle a chunked request (even though we don't send anything).
Tested-by: Mark Kirkwood <mark.kirkwood@catalyst.net.nz> Signed-off-by: Yehuda Sadeh <yehuda@redhat.com>
(cherry picked from commit 3dd4ccad7fe97fc16a3ee4130549b48600bc485c)
Yehuda Sadeh [Tue, 19 Aug 2014 20:15:46 +0000 (13:15 -0700)]
rgw: subuser creation fixes
Fixes: #8587
There were a couple of issues, one when trying to identify whether swift
user exists, we weren't using the correct swift id. The second problem
is that we relied on the gen_access flag in the swift case, where it
doesn't really need to apply.
Greg Farnum [Thu, 23 Oct 2014 00:16:31 +0000 (17:16 -0700)]
client: cast m->get_client_tid() to compare to 16-bit Inode::flushing_cap_tid
m->get_client_tid() is 64 bits (as it should be), but Inode::flushing_cap_tid
is only 16 bits. 16 bits should be plenty to let the cap flush updates
pipeline appropriately, but we need to cast in the proper direction when
comparing these differently-sized versions. So downcast the 64-bit one
to 16 bits.
Samuel Just [Mon, 29 Sep 2014 22:01:25 +0000 (15:01 -0700)]
PG: release backfill reservations if a backfill peer rejects
Also, the full peer will wait until the rejection from the primary
to do a state transition.
Fixes: #9626
Backport: giant, firefly, dumpling Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 624aaf2a4ea9950153a89ff921e2adce683a6f51)
Samuel Just [Mon, 20 Oct 2014 21:10:58 +0000 (14:10 -0700)]
PG:: reset_interval_flush and in set_last_peering_reset
If we have a change in the prior set, but not in the up/acting set, we go back
through Reset in order to reset peering state. Previously, we would reset
last_peering_reset in the Reset constructor. This did not, however, reset the
flush_interval, which caused the eventual flush event to be ignored and the
peering messages to not be sent.
Instead, we will always reset_interval_flush if we are actually changing the
last_peering_reset value.
Xiaoxi Chen [Wed, 20 Aug 2014 07:35:44 +0000 (15:35 +0800)]
CrushWrapper: pick a ruleset same as rule_id
Originally in the add_simple_ruleset funtion, the ruleset_id
is not reused but rule_id is reused. So after some add/remove
against rules, the newly created rule likely to have
ruleset!=rule_id.
We dont want this happen because we are trying to hold the constraint
that ruleset == rule_id.
Ma Jianpeng [Thu, 21 Aug 2014 07:10:46 +0000 (15:10 +0800)]
os/FileJournal: For journal-aio-mode, don't use aio when closing journal.
For jouranl-aio-mode when closing journal, the write_finish_thread_entry may exit before
write_thread_entry. This cause no one wait last aios to complete.
On some platform, after that the journal-header on journal corrupted.
To avoid this, when closing jouranl we don't use aio.
Fixes: 9073 Reported-by: Mark Kirkwood <mark.kirkwood@catalyst.net.nz> Tested-by: Mark Kirkwood <mark.kirkwood@catalyst.net.nz> Signed-off-by: Ma Jianpeng <jianpeng.ma@intel.com>
(cherry picked from commit e870fd09ce846e5642db268c33bbe8e2e17ffef2)
Ma Jianpeng [Wed, 23 Jul 2014 17:10:38 +0000 (10:10 -0700)]
os/FileJournal: Update the journal header when closing journal
When closing journal, it should check must_write_header and update
journal header if must_write_header alreay set.
It can reduce the nosense journal-replay after restarting osd.
Signed-off-by: Ma Jianpeng <jianpeng.ma@intel.com> Reviewed-by: Sage Weil <sage@redhat.com>
(cherry picked from commit 5bf472aefb7360a1fe17601b42e551df120badfb)
Sage Weil [Sun, 21 Sep 2014 22:56:18 +0000 (15:56 -0700)]
osd/ReplicatedPG: do not clone or preserve snapdir on cache_evict
If we cache_evict a head in a cache pool, we need to prevent
make_writeable() from cloning the head and finish_ctx() from
preserving the snapdir object.
Otherwise statfs may fail if mkfs hasn't been run yet or if the monitor
data directory does not exist. There are checks to account for the mon
data dir not existing and we should wait for them to clear before we go
ahead and check the fs stats.
Sage Weil [Thu, 18 Sep 2014 21:23:36 +0000 (14:23 -0700)]
mon: re-bootstrap if we get probed by a mon that is way ahead
During bootstrap we verify that our paxos commits overlap with the other
mons we will form a quorum with. If they do not, we do a sync.
However, it is possible we pass those checks, then fail to join a quorum
before the quorum moves ahead in time such that we no longer overlap.
Currently nothing kicks up back into a probing state to discover we need
to sync... we will just keep trying to call or join an election instead.
Fix this by jumping back to bootstrap if we get a probe that is ahead of
us. Only do this from non probe or sync states as these will be common;
it is only the active and electing states that matter (and probably just
electing!).
Sage Weil [Wed, 13 Aug 2014 23:17:02 +0000 (16:17 -0700)]
mon/Paxos: share state and verify contiguity early in collect phase
We verify peons are contiguous and share new paxos states to catch peons
up at the end of the round. Do this each time we (potentially) get new
states via a collect message. This will allow peons to be pulled forward
and remain contiguous when they otherwise would not have been able to.
For example, if
If we got mon.1 first and then mon.2 second, we would store the new txns
and then boot mon.1 out at the end because 15..25 is not contiguous with
28..40. However, with this change, we share 26..30 to mon.1 when we get
the collect, and then 31..40 when we get mon.2's collect, pulling them
both into the final quorum.
It also breaks the 'catch-up' work into smaller pieces, which ought to
smooth out latency a bit.
Sage Weil [Thu, 14 Aug 2014 23:55:58 +0000 (16:55 -0700)]
mon/Paxos: verify all new peons are still contiguous at end of round
During the collect phase we verify that each peon has overlapping or
contiguous versions as us (and can therefore be caught up with some
series of transactions). However, we *also* assimilate any new states we
get from those peers, and that may move our own first_committed forward
in time. This means that an early responder might have originally been
contiguous, but a later one moved us forward, and when the round finished
they were not contiguous any more. This leads to a crash on the peon
when they get our first begin message.
For example:
- we have 10..20
- first peon has 5..15
- ok!
- second peon has 18..30
- we apply this state
- we are now 18..30
- we finish the round
- send commit to first peon (empty.. we aren't contiguous)
- send no commit to second peon (we match)
- we send a begin for state 31
- first peon crashes (it's lc is still 15)
Prevent this by checking at the end of the round if we are still
contiguous. If not, bootstrap. This is similar to the check we do above,
but reverse to make sure *we* aren't too far ahead of *them*.
The writer thread may submit a new aio to update the header in its
final moments before shutting down. Do not stop the aio thread until after
that has happened or else we may not wait for those aio completions.
Loic Dachary [Tue, 3 Jun 2014 11:05:19 +0000 (13:05 +0200)]
documentation: update osd pool create erasure
The properties are replaced with erasure code profiles. Remove the
reference to properties and the documentation of each erasure-code
related property.
David Zafman [Wed, 24 Sep 2014 23:02:21 +0000 (16:02 -0700)]
osd: Return EOPNOTSUPP if a set-alloc-hint occurs with OSDs that don't support
Add CEPH_FEATURE_OSD_SET_ALLOC_HINT feature bit
Collect the intersection of all peer feature bits during peering
When handling CEPH_OSD_OP_SETALLOCHINT check that all OSDs support it
by checking for CEPH_FEATURE_OSD_SET_ALLOC_HINT feature bit.
Sage Weil [Thu, 8 May 2014 21:19:22 +0000 (14:19 -0700)]
osd/ReplicatedPG: carry CopyOpRef in copy_from completion
There is a race with copy_from cancellation. The internal Objecter
completion decodes a bunch of data and copies it into pointers provided
when the op is queued. When we cancel, we need to ensure that we can cope
until control passes back to our provided completion.
Once we *do* get into the (ReplicatedPG) callbacks, we will bail out
because the tid in the CopyOp or FlushOp no longer matches.
Fix this by carrying a ref to keep the copy-from targets alive, and
clearing out the tids that we cancel.
Note that previously, the trigger for this was that the tid changes when
we handle a redirect, which made the op_cancel() call fail. With the
coming Objecter changes, this will no longer be the case. However, there
are also locking and threading changes that will make cancellation racy,
so we will not be able to rely on it always preventing the callback.
Either way, this will avoid the problem.
Greg Farnum [Thu, 9 Oct 2014 22:12:19 +0000 (15:12 -0700)]
mds: Locker: fix a NULL deref in _update_cap_fields
The MClientCaps* is allowed to be NULL, so we can't deref it unless
the dirty param is non-zero. So don't do the ahead-of-time lookup;
just call it explicitly in the if block.
Johnu George [Mon, 29 Sep 2014 17:07:44 +0000 (10:07 -0700)]
Crush: Ensuring at most num-rep osds are selected
Crush temporary buffers are allocated as per replica size configured
by the user.When there are more final osds (to be selected as per
rule) than the replicas, buffer overlaps and it causes crash.Now, it
ensures that at most num-rep osds are selected even if more number of
osds are allowed by indep rule. The fix for firstn rules is already
merged as part of bug #9492. Required test files are added.
Johnu George [Wed, 24 Sep 2014 16:32:50 +0000 (09:32 -0700)]
Crush: Ensuring at most num-rep osds are selected
Crush temporary buffers are allocated as per replica size configured
by the user.When there are more final osds (to be selected as per
rule) than the replicas, buffer overlaps and it causes crash.Now, it
ensures that at most num-rep osds are selected even if more number of
osds are allowed by the rule.