Sage Weil [Fri, 7 Mar 2014 22:02:26 +0000 (14:02 -0800)]
mon/PGMap: send pg create messages to primary, not acting[0]
For erasure pools, these may not match.
In the case of #7652, this caused pg_create messages to be send
indefinitely. register_pg() added it to the list for acting_primary, and
when we got the (non-creating) pg stat update we removed it from the list
for acting[0].
Fixes: #7652 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Fri, 7 Mar 2014 21:29:03 +0000 (13:29 -0800)]
mon/OSDMonitor: make osdmap feature checks non-racy
The check for OSD features may race with the boot of an OSD that does not
have the necessary features. Check the pending info too, and if there is
a missing feature, return -EAGAIN. In the callers, wait on -EAGAIN.
qa: workunits/mon/rbd_snaps_ops.sh: ENOTSUP on snap rm from copied pool
'rados cppool' copies the contents but that doesn't make the destination
pool an unmanaged snaps pool. Therefore, we must get an ENOTSUP when
we try to remove an unmanaged snap from a not-unmanaged pool.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
mon: OSDMonitor: don't remove unamanaged snaps from not-unmanaged pools
Although we should allow creating unmanaged snaps on not-unamanaged pools,
as long as those pools don't have any managed snapshots in them, we cannot
allow removal -- because the pool will not have any unmanaged snapshots.
Fixes: 7210 Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Sage Weil [Fri, 7 Mar 2014 00:12:30 +0000 (16:12 -0800)]
osd: fix agent thread shutdown
We had an old invariant that agent_queue would have at least 1 entry in
it to simplify some other code paths, but it turns out that it is simpler
not to do that.
In particular, this was triggering a failed assertion on shutdown when we
assert that the queue is empty.
Dump offending items on shutdown if they are there, tho, to catch any
future bugs.
Fixes: #7637 Signed-off-by: Sage Weil <sage@inktank.com>
Loic Dachary [Thu, 6 Mar 2014 23:07:26 +0000 (00:07 +0100)]
logrotate: copy/paste daemon list from *-all-starter.conf
Each upstart/*-all-starter.conf use the same script to find the list of
daemons and their ids. Copy it over to the corresponding logrotate.conf
script instead of using a less reliable script based on initctl list
output.
If logrotate fails to run initctl reload on a daemon, it will keep
writing to the rotated log file, even after it is deleted and until it
fills the disk. By using the exact same shell snippet as the upstart
scripts used to start the daemon, all of them will be sent the HUP
signal and reopen the log file that was just rotated.
Samuel Just [Thu, 6 Mar 2014 20:05:07 +0000 (12:05 -0800)]
ReplicatedPG: clean up num_dirty adjustments
Previously, a _delete_head() followed by a recreation on an object in
the same transaction would result in num_dirty being decremented in
_delete_head() without the flag being cleared. make_writeable() would
then see exists and was_dirty and therefore not increment num_dirty
resulting in a mismatch. Rather than trying to maintain the num_dirty
number in _delete_head(), rollback_to(), and make_writeable(), it seems
simpler to do the adjustment once in make_writeable based on undirty,
ctx->obc->obs.oi, and ctx->new_obs->oi.
Fixes: 7393 Signed-off-by: Samuel Just <sam.just@inktank.com>
Sage Weil [Wed, 5 Mar 2014 23:58:52 +0000 (15:58 -0800)]
mon/OSDMonitor: fix pool deletion checks, races
Unify the pool deletion safety checks into a single set of functions.
Make sure we check the committed state and error out if there is a problem.
Also check the pending state, if any, and delay+retry if there is a
problem there.
This ensures that we correctly verify that a pool is not in use when it
is deleted (by another tier or by cephfs). These checks are also now
applied to librados calls.
Fixes: #7590 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Wed, 5 Mar 2014 18:58:37 +0000 (10:58 -0800)]
mon: warn when pool nears target max objects/bytes
The cache pools will throttle when they reach the target max size, so it
is important to make the administrator aware when they approach that point.
Unfortunately it is not particularly easy to efficiently keep track of
which PGs have hit their limit and use that for reporting. However, it
is easy to raise a flag when we start to approach the target for the
entire pool, and that sort of early warning is arguably more useful
anyway.
Trigger the warning based on the target full ratio. Not when we hit the
target, but when we are 2/3 between it and completely full.
Implements: #7442 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Wed, 5 Mar 2014 18:44:41 +0000 (10:44 -0800)]
mon/PGMap: return empty stats if pool is not in sum
Greg was right!
When a pool is created, the PGs are not added to the PGMap until the *next*
proposal. Weaken the assert here and return empty stats for non-existent
(new) pools so that a pool create + tier add sequence does not crash.
cursor.omap_offet indicates the most recently recovered key, we continue
filling in at the smallest key k | k > cursor.omap_offset. If the loop
as written terminates due to !(left > 0), iter points at the next key to
copy, rather than the last key copied, resulting in the next copy
operation skipping that key.
Now, iter, if valid, must point to the last key copied once the loop has
completed since we check left <= 0 prior to advancing iter. We can
therefore use it to fill in cursor.omap_offset.
Samuel Just [Tue, 4 Mar 2014 23:21:09 +0000 (15:21 -0800)]
ReplicatedPG::fill_in_copy_get: fix early return bug
This is not a leak: we are in an else block where cb must
be NULL. The fix as introduced did not include braces on
the if causing the method to return unconditionally.
Fixes: #7604
Introduced in: 500206d809f0cd85cd99e4f0ec164bbf74f92c28 Reviewed-by: David Zafman <david.zafman@inktank.com> Signed-off-by: Samuel Just <sam.just@inktank.com>
Samuel Just [Mon, 3 Mar 2014 23:33:51 +0000 (15:33 -0800)]
ECBackend,ReplicatedPG: delete temp if we didn't get the transaction
We always send the transaction for operations on temp objects,
but if we didn't get the final transacition on the actual object,
we might end up failing to remove the temp object. Thus, if
we get a sub op and don't have the transaction, just remove the
named temp objects.
Fixes: #7447 Signed-off-by: Samuel Just <sam.just@inktank.com>
This is a friendlier interface for setting up a cache tier with some
reasonable defaults (defined via config options). This will simplify
the user experience and documentation.
Sage Weil [Mon, 3 Mar 2014 19:32:48 +0000 (11:32 -0800)]
mon/OSDMonitor: handle 'osd tier add ...' race/corner case
If you have two racing requests to add two different pools as a tier, the
committed checks will pass but they proposals will conflict. Recheck the
pending pools for the same conditions and wait for a commit if they
occur.
Reported-by: Loic Dachary <loic@dachary.org> Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Tue, 4 Mar 2014 05:11:17 +0000 (21:11 -0800)]
OSDMonitor: do not add non-empty tier pool unless forced
In general, users should not use non-empty pools as new tiers or else
things can behave strangely:
- the data sets are unrelated behavior will be... strange.
- if the cache pool is not "new" and does not do the OMAP flag, the OSD
will not know not to flush omap objects to an EC base tier
- probably other random stuff I'm forgetting
Allow a user to shoot themselves in the foot with --force-nonempty.
Implements: #7457 Signed-off-by: Sage Weil <sage@inktank.com>
Samuel Just [Mon, 3 Mar 2014 01:31:38 +0000 (17:31 -0800)]
TestPGLog: tests for proc_replica_log/merge_log equivalence
We need the merge_log and proc_replica_log paths to result in the
same missing set. This patch adds some machinery for specifying
a log merge scenario and comparing both paths to the same correct
result. This machinery also makes it a bit easier to read and add
new tests.
Samuel Just [Sun, 2 Mar 2014 21:42:16 +0000 (13:42 -0800)]
PGLog::proc_replica_log: _merge_divergent_entries based on truncated olog
We can't merge using the primary's log since we haven't decided whether
to send them a complete log yet. Thus, merge based on the truncated olog
rather than the primary's log. This is a consequence of the division
between trimming divergent entries in peering/unfound search and sending
a complete log to actual members of the actingbackfill set in activate().
_merge_divergent_entries on the truncated log and add_next_event() on the
newer entries result in the same missing/log regardless of the order in
which they are performed.
In the first case, we should end up with foo removed from missing
at the end. In the second, we need foo added to missing at 1,1.
It's far simpler to present all of the divergent entries for a single
object at once.