Sage Weil [Wed, 5 Mar 2014 23:58:52 +0000 (15:58 -0800)]
mon/OSDMonitor: fix pool deletion checks, races
Unify the pool deletion safety checks into a single set of functions.
Make sure we check the committed state and error out if there is a problem.
Also check the pending state, if any, and delay+retry if there is a
problem there.
This ensures that we correctly verify that a pool is not in use when it
is deleted (by another tier or by cephfs). These checks are also now
applied to librados calls.
Fixes: #7590 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Wed, 5 Mar 2014 18:58:37 +0000 (10:58 -0800)]
mon: warn when pool nears target max objects/bytes
The cache pools will throttle when they reach the target max size, so it
is important to make the administrator aware when they approach that point.
Unfortunately it is not particularly easy to efficiently keep track of
which PGs have hit their limit and use that for reporting. However, it
is easy to raise a flag when we start to approach the target for the
entire pool, and that sort of early warning is arguably more useful
anyway.
Trigger the warning based on the target full ratio. Not when we hit the
target, but when we are 2/3 between it and completely full.
Implements: #7442 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Wed, 5 Mar 2014 18:44:41 +0000 (10:44 -0800)]
mon/PGMap: return empty stats if pool is not in sum
Greg was right!
When a pool is created, the PGs are not added to the PGMap until the *next*
proposal. Weaken the assert here and return empty stats for non-existent
(new) pools so that a pool create + tier add sequence does not crash.
cursor.omap_offet indicates the most recently recovered key, we continue
filling in at the smallest key k | k > cursor.omap_offset. If the loop
as written terminates due to !(left > 0), iter points at the next key to
copy, rather than the last key copied, resulting in the next copy
operation skipping that key.
Now, iter, if valid, must point to the last key copied once the loop has
completed since we check left <= 0 prior to advancing iter. We can
therefore use it to fill in cursor.omap_offset.
Samuel Just [Tue, 4 Mar 2014 23:21:09 +0000 (15:21 -0800)]
ReplicatedPG::fill_in_copy_get: fix early return bug
This is not a leak: we are in an else block where cb must
be NULL. The fix as introduced did not include braces on
the if causing the method to return unconditionally.
Fixes: #7604
Introduced in: 500206d809f0cd85cd99e4f0ec164bbf74f92c28 Reviewed-by: David Zafman <david.zafman@inktank.com> Signed-off-by: Samuel Just <sam.just@inktank.com>
Samuel Just [Mon, 3 Mar 2014 23:33:51 +0000 (15:33 -0800)]
ECBackend,ReplicatedPG: delete temp if we didn't get the transaction
We always send the transaction for operations on temp objects,
but if we didn't get the final transacition on the actual object,
we might end up failing to remove the temp object. Thus, if
we get a sub op and don't have the transaction, just remove the
named temp objects.
Fixes: #7447 Signed-off-by: Samuel Just <sam.just@inktank.com>
This is a friendlier interface for setting up a cache tier with some
reasonable defaults (defined via config options). This will simplify
the user experience and documentation.
Sage Weil [Mon, 3 Mar 2014 19:32:48 +0000 (11:32 -0800)]
mon/OSDMonitor: handle 'osd tier add ...' race/corner case
If you have two racing requests to add two different pools as a tier, the
committed checks will pass but they proposals will conflict. Recheck the
pending pools for the same conditions and wait for a commit if they
occur.
Reported-by: Loic Dachary <loic@dachary.org> Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Tue, 4 Mar 2014 05:11:17 +0000 (21:11 -0800)]
OSDMonitor: do not add non-empty tier pool unless forced
In general, users should not use non-empty pools as new tiers or else
things can behave strangely:
- the data sets are unrelated behavior will be... strange.
- if the cache pool is not "new" and does not do the OMAP flag, the OSD
will not know not to flush omap objects to an EC base tier
- probably other random stuff I'm forgetting
Allow a user to shoot themselves in the foot with --force-nonempty.
Implements: #7457 Signed-off-by: Sage Weil <sage@inktank.com>
Samuel Just [Mon, 3 Mar 2014 01:31:38 +0000 (17:31 -0800)]
TestPGLog: tests for proc_replica_log/merge_log equivalence
We need the merge_log and proc_replica_log paths to result in the
same missing set. This patch adds some machinery for specifying
a log merge scenario and comparing both paths to the same correct
result. This machinery also makes it a bit easier to read and add
new tests.
Samuel Just [Sun, 2 Mar 2014 21:42:16 +0000 (13:42 -0800)]
PGLog::proc_replica_log: _merge_divergent_entries based on truncated olog
We can't merge using the primary's log since we haven't decided whether
to send them a complete log yet. Thus, merge based on the truncated olog
rather than the primary's log. This is a consequence of the division
between trimming divergent entries in peering/unfound search and sending
a complete log to actual members of the actingbackfill set in activate().
_merge_divergent_entries on the truncated log and add_next_event() on the
newer entries result in the same missing/log regardless of the order in
which they are performed.
In the first case, we should end up with foo removed from missing
at the end. In the second, we need foo added to missing at 1,1.
It's far simpler to present all of the divergent entries for a single
object at once.
Ilya Dryomov [Fri, 21 Feb 2014 14:34:14 +0000 (16:34 +0200)]
librbd: prefix rbd writes with CEPH_OSD_OP_SETALLOCHINT osd op
In an effort to reduce fragmentation, prefix every rbd write with
a CEPH_OSD_OP_SETALLOCHINT osd op with an expected_write_size value set
to the object size (1 << order). Backwards compatibility is taken care
of on the osd side.
"The CEPH_OSD_OP_SETALLOCHINT hint is durable, in that it's enough to
do it once. The reason every rbd write is prefixed is that rbd doesn't
explicitly create objects and relies on writes creating them
implicitly, so there is no place to stick a single hint op into. To
get around that we decided to prefix every rbd write with a hint (just
like write and setattr ops, hint op will create an object implicitly if
it doesn't exist)."
Ilya Dryomov [Fri, 21 Feb 2014 14:34:13 +0000 (16:34 +0200)]
FileStore: introduce XfsFileStoreBackend class
Introduce XfsFileStoreBackend class, currently the only filestore
backend implementing SETALLOCHINT op. This commit adds a build-time
dependency on libxfs as xfs-specific ioctl (XFS_IOC_FSSETXATTR /
XFS_XFLAG_EXTSIZE) is used to implement the new set_alloc_hint()
method.
Ilya Dryomov [Fri, 21 Feb 2014 14:34:13 +0000 (16:34 +0200)]
FileStore: refactor FS detection checks a bit
Refactor FS detection checks in FileStore::_detect_fs() so that they
look the same as the ones in FileStore::mkfs(). This is in preparation
for adding XfsFileStoreBackend class.
Ilya Dryomov [Fri, 21 Feb 2014 14:34:13 +0000 (16:34 +0200)]
osd: add SETALLOCHINT operation
This is primarily for librbd/krbd's benefit and is supposed to combat
fragmentation:
"... knowing that rbd images have a 4m size, librbd can pass a hint
that will let the osd do the xfs allocation size ioctl on new files so
that they are allocated in 1m or 4m chunks. We've seen cases where
users with rbd workloads have very high levels of fragmentation in xfs
and this would mitigate that and probably have a pretty nice
performance benefit."
SETALLOCHINT is considered advisory, so our backwards compatibility
mechanism here is to set FAILOK flag for all SETALLOCHINT ops.
Babu Shanmugam [Wed, 19 Feb 2014 12:43:53 +0000 (12:43 +0000)]
Following changes are made
1. Increased the String length for distro, version and os_desc columns in osds_info table
2. Corrected version information extraction in client/ceph-brag
3. Removed the version_id json entry when version list returned for UUID
4. Updated the README to reflect point 3
osd: OSD: limit the value of 'size' and 'count' on 'osd bench'
Otherwise, a high enough 'count' value will trigger all sorts of timeouts
on the OSD; a low enough 'size' value will have the same effect for a
high enough value of 'count' (even the default value may have ill effects
on the osd's behaviour). Limiting these values do not fix how 'osd bench'
should behave, but avoid someone from inadvertently bork an OSD.
Four options have been added and the user may adjust them if he so
desires to play with the OSD's fate:
- 'osd_bench_small_size_max_iops' [default: 100] defines the amount of
expected IOPS for a small block size (i.e., <1MB).
- 'osd_bench_large_size_max_throughput' [default: 100<<20] defines
the expected throughput in B/s. We assume 100MB/s.
- 'osd_bench_max_block_size' [default: 64 << 20] caps the block size
allowed. We have defined 64 MB.
- 'osd_bench_duration' [default: 30] caps the expected duration. This
values is used when calculating the maximum allowed 'count', and is
not enforced as the maximum duration of the operation. If other IO
is undergoing, or 'osd bench' is somehow slowed down, 'osd bench' may
go over this duration. Adjusting this option does however allow the
user to specify higher 'count' values for (e.g.) a small block size,
as the operation is assumed to perform the operation over a longer
time span.
These options attempt to avoid combinations of dangerous parameters. For
instance, we limit the block size to 64 MB (by default) so that there is
no temptation to specify a large enough block size, along with a very small
'count', such that the end result is similar to specifying a big count with
a sane block size.
Fixes: 7248 Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Loic Dachary [Fri, 28 Feb 2014 12:57:20 +0000 (13:57 +0100)]
osd: do not attempt to read past the object size
When reading from a replicated pool, trying to read more than the object
size results in a short read that does not go beyond the object size. In
erasure coded pools, objects are padded and the read will return more
bytes than the object actually contains.
Samuel Just [Sat, 1 Mar 2014 22:33:11 +0000 (14:33 -0800)]
osd_types,PG: trim mod_desc for log entries to min size
In the event that mod_desc.bl contains pointers into a large
message buffer, we'd otherwise end up keeping around the entire
MOSDECSubOpWrite which created each log entry.
Fixes: #7539 Signed-off-by: Samuel Just <sam.just@inktank.com>