ceph-volume: Using --readonly for {vg|pv|lv}s commands
The actual code is detecting {vg|pv|lv}s by running the usual {vg|pv|lv}s commands.
Those calls expect lvmetad to be aware of the actual state of them.
This works actually pretty well in most of the cases.
When ceph-volume is run from a container,
this code also works from the container itself but don't on the host.
On the host side, running {vg|pv|lv}s commands reports nothing,
making ceph-volume reporting "No valid Ceph devices found"
The root cause is lvmetad not receiving the udev notification and so,
{vg|pv|lv}s commands reports the 'known' state instead of the 'real' state.
This is a major issue as it means that it exists cases or maybe races where
ceph-volume reports "No valid Ceph devices found" while the disk
actually have some Ceph data & metadata on them.
This could be interpreted like disks are free/available while they are not.
This will mislead users or configuration tools trying to understand the
current state of a node.
In July 2015, as per https://www.redhat.com/archives/lvm-devel/2015-July/msg00086.html,
a new option called "--readonly" have been added to lvm2.
One of the most interesting part of it is : "disable using lvmetad so VGs are read from disk"
In our case, that feature is really interesting as it means that
everytime ceph-volume calls a {vg|pv|lv}s command, it will read the
metadata from the disks instead of considering the lvmetad status.
This patch change all the {vg|pv|lv}s call to use --readonly.
It solves the bug exposed here and doesn't affect the traditional use-case.
The main benefit of this patch is to avoid a false report of a disk not having metadata.
Currently the create code decides the vg_name "ceph-$cluster_fsid" as
the primary vg_name and creates a new name if this already exists.
If this code is run N times in parallel, the script will try to
create N times the vg with the name "ceph-$cluster_fsid" and it
will fail to create the N osds successfully.
Creating vgs with names like "ceph-$uuid4" lets our scripts to run
without any problems.
David Zafman [Wed, 11 Apr 2018 15:43:56 +0000 (08:43 -0700)]
test: Luminous backport specific changes
osd-scrub-repair.sh:
Remove legacy_snaps and redirect_target to keep test more in sync with master
We don't have omap_digest bluestore handling
Additional cases of omap_digest_mismatch_oi changes
omap_digest still set
ROBJ10 output still applies
osd-scrub-snaps.sh:
Add head_exists in snapset
Add snapset for object with head_mismatch
J. Eric Ivancich [Fri, 13 Apr 2018 22:27:25 +0000 (18:27 -0400)]
osd: remove cost from mclock op queues; cost not handled well in dmclock library
The current version of the dmclock library does not handle operation
cost well. Therefore cost should not be passed into the library when
enqueuing operations; instead 0 should be passed in.
Signed-off-by: J. Eric Ivancich <ivancich@redhat.com>
Sage Weil [Sat, 28 Oct 2017 20:37:03 +0000 (15:37 -0500)]
include/interval_set: tolerate maps that invalidate iterator on change
These changes picked out of the diff between the original
btree_interval_set.h and interval_set.h (sadly I had it rolled into the
initial commit so it was tedious to identify these).
inverval_set: optimize subset_of with sequential search
Optimize subset_of to use sequential search when it
performs better than the lower_bound method, for set
size ratios smaller than 10. This is analogous to
intersection_of behavior since commit 825470fcf919.
The subset_of method can be used in some cases as a
less-expensive alternative to the intersection_of
method, since subset_of can return early if any element
of the smaller set is not contained in the larger set,
and intersection_of has the added burden of storing
the intersecting elements.
Zac Medico [Thu, 31 Aug 2017 03:59:32 +0000 (20:59 -0700)]
interval_set: optimize intersection_of
Iterate over all elements of the smaller set, and use find_inc to
locate elements from the larger set in logarithmic time. This greatly
improves performance when one set is much larger than the other:
2 +-+--+----+----+----+----+----+----+----+----+--+-+
P +* +
E |* |
R 1.8 +* +
F | * |
O | * |
R 1.6 + * +
M | * |
A | * |
N 1.4 + * +
C | * |
E | * |
1.2 + * +
R | * |
A | * |
T 1 + *** +
I | ****** |
O + ***********************************
0.8 +-+--+----+----+----+----+----+----+----+----+--+-+
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
SET SIZE RATIO
The above plot compares performance of the new intersection_size_asym
function to the existing intersection_of function. The performance of
intersection_size_asym gets worse as the set size ratio approaches 1.
For set size ratios where the performance ratio is greater than 1, the
performance of intersection_size_asym is superior. Therefore, this
patch only uses intersection_size_asym when the set size ratio is less
than or equal to 0.1 (code uses the reciprocal which is 10).
The plot was generated using benchmark results produced by the
following program:
int main()
{
const int interval_count = 100000;
const int interval_distance = 4;
const int interval_size = 2;
const int sample_count = 8;
const int max_offset = interval_count * interval_distance;
interval_set<int> a, b, intersection;
for (int i = 0; i < max_offset; i+=interval_distance) {
a.insert(i, interval_size);
}
for (int m = 1; m < 100; m++) {
float ratio = 1 / float(m);
for (int i = 0; i < max_offset; i+=interval_distance*m) {
b.insert(i, interval_size);
}
struct timeb start, end;
int ms = 0;
for (int i = 0; i < sample_count; i++) {
ftime(&start);
intersection.intersection_of(a, b);
ftime(&end);
ms += (int) (1000.0 * (end.time - start.time)
+ (end.millitm - start.millitm));
intersection.clear();
}
b.clear();
std::cout << ratio << "\t" << ms << std::endl << std::flush;
}
}
rgw: rgw-rados, rgw-admin add an option to recalculate user stats
Adds a method in rgw-rados to reset user stats calling the earlier implemented
cls user reset stats.
In rgw-admin we add an option called --reset-stats that invokes this method.
This is an implementation of reset user stats, that recalculates the user stats
purely based on the values of bucket entries in user.buckets object. This is
helpful in cases when user stats has been improperly set in case of manual
resharding etc.
Luminous only additions:
use global scope for encode in cls_user_ops
rgw: Fix multisite Synchronization failed when read and write delete at the same time
This case is first write objA,then write and delete objA at the same
time,write early than delete.
When del objA, use information which stat of first write objA, so the
op should del the first write data.However when try to del objA, objA
header is second write, so osd "do_xattr_cmp_str" has found idtag change
and return -125(canceled),after rgw client receive the ret -125 , it
will still do "complete_del", then do cls_obj_complete_del to write
bilog。"complete_op" in cls_rgw module will write bilog with second
write mtime and second ".ver.epoch". Finally, del op append behind the
second write in bilog. And the slave rgw will merge write op and del op
as del op, and del data,but master rgw complete second write and cancel
del.
This logic is problematic, so bilog recording the del op should use
cancel op. And squash_map should skip the cancel op.
Robin H. Johnson [Mon, 12 Mar 2018 21:38:57 +0000 (14:38 -0700)]
cls/rgw: usage_iterate_range truncated should never be NULL
Ensuring truncated is non-NULL improves code clarity.
Suggested-by: Yehuda Sadeh <yehuda@redhat.com> Signed-off-by: Robin H. Johnson <robin.johnson@dreamhost.com>
(cherry picked from commit f9eb79bbbf30f24135165a2a33aa2a80916b4005)
Robin H. Johnson [Mon, 12 Mar 2018 21:42:00 +0000 (14:42 -0700)]
cls/rgw: usage_iterate_range end conditions
If both end & start conditions are passed, we need to signal that there
are no more results when returning because the end condition is
satisifed.
Suggested-by: Yehuda Sadeh <yehuda@redhat.com> Signed-off-by: Robin H. Johnson <robin.johnson@dreamhost.com>
(cherry picked from commit 163c49fd1da8e819a0ff8c7c7aaa550b01b8012a)
Greg Farnum [Tue, 6 Mar 2018 00:40:41 +0000 (16:40 -0800)]
cls/rgw: make usage_iterate_range()'s "truncated" parameter trustworthy
Set it to false whenever we identify that we've reached the end of our
range, even if the underlying OSD op says it could have given us more
results. Then rely on that instead of weird iter-based logic to tell
our client that it doesn't need to do more work.
John Spray [Mon, 9 Apr 2018 12:22:29 +0000 (13:22 +0100)]
osd: make PG "deep" state name consistent
Previously the state-to-string path called
deep scrubbing "deep", but the string-to-state
path called it "deep_scrub". Confusing! Also
meant that the "pg ls" help output was wrong
because it uses the state-to-string path.