Sage Weil [Wed, 3 Dec 2014 00:33:11 +0000 (16:33 -0800)]
crush: fix crush_calc_straw() scalers when there are duplicate weights
The straw bucket was originally tested with uniform weights and with a
few more complicated patterns, like a stair step (1,2,3,4,5,6,7,8,9). And
it worked!
However, it does not behave with a pattern like
1, 2, 2, 3, 3, 4, 4
Strangely, it does behave with
1, 1, 2, 2, 3, 3, 4, 4
and more usefully it does behave with
1, 2, 2.001, 3, 3.001, 4, 4.001
That is, the logic that explicitly copes with weights that are duplicates
is broken.
The fix is to simply remove the special handling for duplicate weights --
it isn't necessary and doesn't work correctly anyway.
Add a test that compares the mapping result of [1, 2, 2, 3, 3, ...] with
[1, 2, 2.001, 3, 3.001, ...] and verifies that the difference is small.
With the fix, we get .00012, whereas the original implementation gets
.015.
Note that this changes the straw bucket scalar *precalculated* values that
are encoded with the map, and only when the admin opts into the new behavior.
Sage Weil [Tue, 2 Dec 2014 22:50:21 +0000 (14:50 -0800)]
crush: fix distortion of straw scalers by 0-weight items
The presence of a 0-weight item in a straw bucket should have no effect
on the placement of other items. Add a test validating that and fix
crush_calc_straw() to fix the distortion.
Note that this effects the *precalculation* of the straw bucket inputs and
does not effect the actually mapping process given a compiled or encoded
CRUSH map, and only when straw_calc_version == 1 (i.e., the admin opted in
to the new behavior).
Rongze Zhu [Fri, 10 Oct 2014 11:18:00 +0000 (19:18 +0800)]
crush: fix incorrect use of adjust_item_weight method
adjust_item_weight method will adjust all buckets which the item
inside. If the osd.0 in host=fake01 and host=fake02, we execute
"ceph osd crush osd.0 10 host=fake01", it not only will adjust fake01's
weight, but also will adjust fake02's weight.
the patch add adjust_item_weightf_in_loc method and fix remove_item,
_remove_item_under, update_item, insert_item, detach_bucket methods.
Sage Weil [Thu, 13 Nov 2014 18:59:22 +0000 (10:59 -0800)]
crush/CrushWrapper: fix detach_bucket
In commit 9850227d2f0ca2f692a154de2c14a0a08e751f08 we changed the call that
changed the weight of all instances of item to one that explicitly
changes it in the parent bucket, but parent_id may not be valid at the
call site. Move this into the conditional block to fix.
Sage Weil [Tue, 2 Dec 2014 18:08:18 +0000 (10:08 -0800)]
crush: recalculate straw scalers during a reweight
The crushtool --reweight function triggers a fresh calculation of bucket
weights so that they are always the sum of the item weights. In the
straw bucket case, the weights were updated but the corresponding straw
scalers were not being recalculated. The result is that there was not
effect on placement in adjusted buckets until the next time a bucket item's
weight was adjusted.
Rongze Zhu [Mon, 10 Nov 2014 16:13:42 +0000 (00:13 +0800)]
crush: fix tree bucket functions
There are incorrect nodes' weight in tree bucket when construct tree
bucket. The tree bucket don't store item id in items array, so the tree
bucket will not work correctly. The patch fix above bugs and add a
simple test for tree bucket.
Sage Weil [Thu, 8 Jan 2015 19:17:03 +0000 (11:17 -0800)]
osd: requeue PG when we skip handling a peering event
If we don't handle the event, we need to put the PG back into the peering
queue or else the event won't get processed until the next event is
queued, at which point we'll be processing events with a delay.
The queue_null is not necessary (and is a waste of effort) because the
event is still in pg->peering_queue and the PG is queued.
Loic Dachary [Fri, 9 Jan 2015 00:32:17 +0000 (01:32 +0100)]
Merge pull request #3127 from ktdreyer/firefly-no-epoch
Revert "ceph.spec.: add epoch"
Reviewed-by: Ken Dreyer <kdreyer@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com> Reviewed-by: Samuel Just <sjust@redhat.com> Reviewed-by: Loic Dachary <ldachary@redhat.com>
Loic Dachary [Fri, 9 Jan 2015 00:28:11 +0000 (01:28 +0100)]
Merge pull request #3220 from ceph/wip-mon-backports.firefly
mon: backports for #9987 against firefly
Reviewed-by: Joao Eduardo Luis <joao@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com> Reviewed-by: Samuel Just <sjust@redhat.com> Reviewed-by: Loic Dachary <ldachary@redhat.com>
Sage Weil [Tue, 23 Dec 2014 23:49:26 +0000 (15:49 -0800)]
osdc/Objecter: handle reply race with pool deletion
We need to handle this scenario:
- send request in epoch X
- osd replies
- pool is deleted in epoch X+1
- client gets map X+1, sends a map check
- client handles reply
-> asserts that no map checks are in flight
This isn't the best solution. We could infer that a map check isn't needed
since the pool existed earlier and doesn't now. But this is firefly and
the fix is no more expensive than the old assert.
Fixes: #10372 Signed-off-by: Sage Weil <sage@redhat.com>
Danny Al-Gaaf [Tue, 24 Jun 2014 17:54:17 +0000 (19:54 +0200)]
test/ceph-disk.sh: fix for SUSE
On SUSE 'which' returns always the full path of (shell) commands and
not e.g. './ceph-conf' as on Debian. Add check also for full
path returned by which.
Loic Dachary [Fri, 13 Jun 2014 12:41:39 +0000 (14:41 +0200)]
tests: prevent kill race condition
When trying to kill a daemon, keep its pid in a variable instead of
retrieving it from the pidfile multiple times. It prevents the following
race condition:
* try to kill ceph-mon
* ceph-mon is in the process of dying and removed its pidfile
* try to kill ceph-mon fails because the pidfile is not found
* another ceph-mon is spawned and fails to bind the port
because the previous ceph-mon is still holding it
Danny Al-Gaaf [Tue, 24 Jun 2014 22:31:48 +0000 (00:31 +0200)]
osd/OSD.cc: parse lsb release data via lsb_release
Use lsb_release tool to be portable since parsing /etc/lsb-release
is not the same between different distributions. The old code failed
e.g. for SUSE products to parse LSB information.
Sage Weil [Mon, 24 Nov 2014 02:50:51 +0000 (18:50 -0800)]
crush/CrushWrapper: fix create_or_move_item when name exists but item does not
We were using item_exists(), which simply checks if we have a name defined
for the item. Instead, use _search_item_exists(), which looks for an
instance of the item somewhere in the hierarchy. This matches what
get_item_weightf() is doing, which ensures we get a non-negative weight
that converts properly to floating point.
Backport: giant, firefly Fixes: #9998 Reported-by: Pawel Sadowski <ceph@sadziu.pl> Signed-off-by: Sage Weil <sage@redhat.com>
(cherry picked from commit 9902383c690dca9ed5ba667800413daa8332157e)
Sage Weil [Sat, 22 Nov 2014 01:47:56 +0000 (17:47 -0800)]
crush/builder: prevent bucket weight underflow on item removal
It is possible to set a bucket weight that is not the sum of the item
weights if you manually modify/build the CRUSH map. Protect against any
underflow on the bucket weight when removing items.
Dan Mick [Wed, 10 Dec 2014 21:19:16 +0000 (13:19 -0800)]
rados.py: remove Rados.__del__(); it just causes problems
Recent versions of Python contain a change to thread shutdown that
causes ceph to hang on exit; see http://bugs.python.org/issue21963.
As it turns out, this is relatively easy to avoid by not spawning
threads on exit, as Rados.__del__() will certainly do by calling
shutdown(); I suspect, but haven't proven, that the problem is
that shutdown() tries to start() a threading.Thread() that never
makes it all the way back to signal start().
Also add a PendingReleaseNote and extra doc comments to clarify.
Ken Dreyer [Tue, 9 Dec 2014 21:52:19 +0000 (14:52 -0700)]
Revert "ceph.spec.: add epoch"
If ICE ships 0.80.8, then it will be newer than what RHCEPH ships
(0.80.7), and users won't be able to seamlessly upgrade via Yum.
We have three options:
A) Revert the "Epoch: 1" change on the Firefly branch.
B) Revert the "Epoch: 1" change in the ICE packages.
C) Bump the Epoch to "2" in Red Hat's packages.
This commit does Option A.
Option B may or may not be feasible - it would require a "downstream"
change in ICE, and we haven't done that sort of thing before.
Due to the RHEL release schedule, Option C is not available to us at
this point.
Loic Dachary [Fri, 14 Nov 2014 00:16:10 +0000 (01:16 +0100)]
common: do not omit shard when ghobject NO_GEN is set
Do not silence the display of shard_id when generation is NO_GEN.
Erasure coded objects JSON representation used by ceph_objectstore_tool
need the shard_id to find the file containing the chunk.
Minimal testing is added to ceph_objectstore_tool.py
Samuel Just [Tue, 23 Sep 2014 22:52:08 +0000 (15:52 -0700)]
ReplicatedPG: don't move on to the next snap immediately
If we have a bunch of trimmed snaps for which we have no
objects, we'll spin for a long time. Instead, requeue.
Fixes: #9487
Backport: dumpling, firefly, giant Reviewed-by: Sage Weil <sage@redhat.com> Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit c17ac03a50da523f250eb6394c89cc7e93cb4659)
Sage Weil [Tue, 23 Sep 2014 23:21:33 +0000 (16:21 -0700)]
osd: initialize purged_snap on backfill start; restart backfill if change
If we backfill a PG to a new OSD, we currently neglect to initialize
purged_snaps. As a result, the first time the snaptrimmer runs it has to
churn through every deleted snap for all time, and to make matters worse
does so in one go with the PG lock held. This leads to badness on any
cluster with a significant number of removed snaps that experiences
backfill.
Resolve this by initializing purged_snaps when we finish backfill. The
backfill itself will clear out any stray snaps and ensure the object set
is in sync with purged_snaps. Note that purged_snaps on the primary
that is driving backfill will not change during this period as the
snaptrimmer is not scheduled unless the PG is clean (which it won't be
during backfill).
If we by chance to interrupt backfill, go clean with other OSDs,
purge snaps, and then let this OSD rejoin, we will either restart
backfill (non-contiguous log) or the log will include the result of
the snap trim (the events that remove the trimmed snap).
Jason Dillaman [Tue, 18 Nov 2014 02:49:26 +0000 (21:49 -0500)]
librbd: protect list_children from invalid child pool IoCtxs
While listing child images, don't ignore error codes returned
from librados when creating an IoCtx. This will prevent seg
faults from occurring when an invalid IoCtx is used.
Fixes: #10123
Backport: giant, firefly, dumpling Signed-off-by: Jason Dillaman <dillaman@redhat.com>
(cherry picked from commit 0d350b6817d7905908a4e432cd359ca1d36bab50)
Loic Dachary [Thu, 9 Oct 2014 16:52:17 +0000 (18:52 +0200)]
ceph-disk: run partprobe after zap
Not running partprobe after zapping a device can lead to the following:
* ceph-disk prepare /dev/loop2
* links are created in /dev/disk/by-partuuid
* ceph-disk zap /dev/loop2
* links are not removed from /dev/disk/by-partuuid
* ceph-disk prepare /dev/loop2
* some links are not created in /dev/disk/by-partuuid
This is assuming there is a bug in the way udev events are handled by
the operating system.
Loic Dachary [Fri, 10 Oct 2014 08:23:34 +0000 (10:23 +0200)]
ceph-disk: encapsulate partprobe / partx calls
Add the update_partition function to reduce code duplication.
The action is made an argument although it always is -a because it will
be -d when deleting a partition.
Use the update_partition function in prepare_journal_dev
Introduce ceph_erasure_code_non_regression to check and compare how an
erasure code plugin encodes and decodes content with a given set of
parameters. For instance:
Will create an encoded object (--create) and store it into a directory
along with the chunks, one chunk per file. The directory name is derived
from the parameters. The content of the object is a random pattern of 31
bytes repeated to fill the object size specified with --stripe-width.
The check function (--check) reads the object back from the file,
encodes it and compares the result with the content of the chunks read
from the files. It also attempts recover from one or two erasures.
Chunks encoded by a given version of Ceph are expected to be encoded
exactly in the same way by all Ceph versions going forward.
Sage Weil [Mon, 20 Oct 2014 20:55:33 +0000 (13:55 -0700)]
osd: discard rank > 0 ops on erasure pools
Erasure pools do not support read from replica, so we should drop
any rank > 0 requests.
This fixes a bug where an erasure pool maps to [1,2,3], temporarily maps
to [-1,2,3], sends a request to osd.2, and then remaps back to [1,2,3].
Because the 0 shard never appears on osd.2, the request sits in the
waiting_for_pg map indefinitely and cases slow request warnings.
This problem does not come up on replicated pools because all instances of
the PG are created equal.
Fix by only considering role == 0 for erasure pools as a correct mapping.
Fixes: #9835 Signed-off-by: Sage Weil <sage@redhat.com>
Sage Weil [Thu, 13 Nov 2014 01:04:35 +0000 (17:04 -0800)]
osd/OSDMap: add osd_is_valid_op_target()
Helper to check whether an osd is a given op target for a pg. This
assumes that for EC we always send ops to the primary, while for
replicated we may target any replica.
Josh Durgin [Wed, 12 Nov 2014 02:16:02 +0000 (18:16 -0800)]
qa: allow small allocation diffs for exported rbds
The local filesytem may behave slightly differently. This isn't
foolproof, but seems to be reliable enough on rhel7 rootfs, where
exact comparison was failing.
Sage Weil [Sun, 25 May 2014 15:38:38 +0000 (08:38 -0700)]
osd: fix map advance limit to handle map gaps
The recent change in cf25bdf6b0090379903981fe8cee5ea75efd7ba0 would stop
advancing after some number of epochs, but did not take into consideration
the possibilty that there are missing maps. In that case, it is impossible
to advance past the gap.
Fix this by increasing the max epoch as we go so that we can always get
beyond the gap.
John Spray [Fri, 7 Nov 2014 11:34:43 +0000 (11:34 +0000)]
tools: fix MDS journal import
Previously it only worked on fresh filesystems which
hadn't been trimmed yet, and resulted in an invalid
trimmed_pos when expire_pos wasn't on an object
boundary.
Sage Weil [Mon, 15 Sep 2014 22:29:08 +0000 (15:29 -0700)]
ceph-disk: mount xfs with inode64 by default
We did this forever ago with mkcephfs, but ceph-disk didn't. Note that for
modern XFS this option is obsolete, but for older kernels it was not the
default.
Yehuda Sadeh [Thu, 9 Oct 2014 17:20:27 +0000 (10:20 -0700)]
rgw: set length for keystone token validation request
Fixes: #7796
Backport: giany, firefly
Need to set content length to this request, as the server might not
handle a chunked request (even though we don't send anything).
Tested-by: Mark Kirkwood <mark.kirkwood@catalyst.net.nz> Signed-off-by: Yehuda Sadeh <yehuda@redhat.com>
(cherry picked from commit 3dd4ccad7fe97fc16a3ee4130549b48600bc485c)
Yehuda Sadeh [Tue, 19 Aug 2014 20:15:46 +0000 (13:15 -0700)]
rgw: subuser creation fixes
Fixes: #8587
There were a couple of issues, one when trying to identify whether swift
user exists, we weren't using the correct swift id. The second problem
is that we relied on the gen_access flag in the swift case, where it
doesn't really need to apply.
Greg Farnum [Thu, 23 Oct 2014 00:16:31 +0000 (17:16 -0700)]
client: cast m->get_client_tid() to compare to 16-bit Inode::flushing_cap_tid
m->get_client_tid() is 64 bits (as it should be), but Inode::flushing_cap_tid
is only 16 bits. 16 bits should be plenty to let the cap flush updates
pipeline appropriately, but we need to cast in the proper direction when
comparing these differently-sized versions. So downcast the 64-bit one
to 16 bits.