Sage Weil [Fri, 29 Aug 2014 15:29:35 +0000 (08:29 -0700)]
mds/RecoveryQueue: do not start prioritized items synchronously
When we prioritize an item move it into a second priority list/set, but
do not start immediately, so that we still obey the max. When we go to
start an item, pull items first off the priority list, then off the regular
list.
Sage Weil [Thu, 14 Aug 2014 21:52:40 +0000 (14:52 -0700)]
mds/RecoveryQueue: add method to prioritize a file recovery; fix logging
Add a prioritize() method to make file recovery start immediately for the
given inode. Note that this doesn't respect the max recovery limit: if
someone stats it, they are blocking, and we start the recovery immediately.
Also fix up the dout logging a bit so that everything is prefixed
consistently.
Sage Weil [Thu, 14 Aug 2014 21:39:29 +0000 (14:39 -0700)]
mds: change mds_max_file_recover from 5 -> 32
These are reasonably cheap operations (stat) and we should be too worried
about queueing up a bunch of them.
Ideally this sort of thing would magically tune to the throughput we can
get from the cluster, but until then, let's choose a default that works for
more users.
Sage Weil [Thu, 28 Aug 2014 17:59:18 +0000 (10:59 -0700)]
test/mon/*: prime mon with initial command before injection
The osdmonitor_prepare_command is very fragile. Send an initial command
to the mon beforehand. This seems to prevent the initial command from
getting combined into an early mon proposal with some other stuff.
Alternatively, we could remove these tests and this mechanism entirely as
it is likely to great in the future when the next set of mon changes are
made, but they have shown themselves to be useful it catching other
regressions, so we'll patch them up for a bit longer.
Loic Dachary [Sat, 23 Aug 2014 09:07:29 +0000 (11:07 +0200)]
erasure-code: assert the PluginRegistry lock is held when it must
Add lock to the preload method and assert that it is held by methods
requiring it. Although preload is called at bootstrap and does not
require the lock, adding it does not hurt and makes the lock policy
clearer to understand.
Loic Dachary [Thu, 21 Aug 2014 16:38:52 +0000 (18:38 +0200)]
erasure-code: add Ceph version check to plugins
Add the __erasure_code_version function to all plugins, to return the
Ceph version against which they have been compiled. When a plugin is
loaded, an error is thrown if the version of the plugin does not match
the version of the daemon loading it.
If the symbol does not exist, which will be true of older plugins, set
the version to "an older version" so it never matches.
Loic Dachary [Thu, 21 Aug 2014 16:31:02 +0000 (18:31 +0200)]
erasure-code: jerasure preloads the plugin variant
The variant selection depending on the available CPU features is
encapsulated in a helper. The helper is used in the factory() method and
in the load() method.
The factory() method may load a variant that is not the default, for
benchmark purposes. Such a variant is not preloaded by the load() method
and upgrading while running may be problematic. However, running with a
non standard variant is used for benchmarking and upgrades in this
context are not a concern.
Loic Dachary [Thu, 21 Aug 2014 16:22:18 +0000 (18:22 +0200)]
erasure-code: add directory to plugin init functions
The prototype of the init functions of erasure coded plugins is changed
from
int __erasure_code_init(char *plugin_name)
to
int __erasure_code_init(char *plugin_name, char *directory)
The jerasure plugin will find optimized variants in this directory and
load them. The load() and preload() functions of
ErasureCodePluginRegistry only use a directory instead of a more generic
parameters map. The parameters map was only used for the directory entry
anyway.
Sage Weil [Tue, 19 Aug 2014 23:48:34 +0000 (16:48 -0700)]
mon/Paxos: make backend write async
Move into the WRITING state and do the write to leveldb (or whatever the
backend is) asynchronously.
A few tricks here:
- we can't do the is_updating() state check because we will always be in
REFRESH. Instead, make commit_proposal() tolerate the case where it is
called but the top proposal isn't the one we just did (or the list is
empty). This makes the callers simpler.
- do_refresh() may call bootstrap. If we do bootstrap while in REFRESH,
don't do a sync/flush on the backend store because *we* are async
completion thread and we'll deadlock. All other callers need to wait
for this, though!
Sage Weil [Tue, 19 Aug 2014 23:45:46 +0000 (16:45 -0700)]
mon/Paxos[Service]: allow reads during WRITING state
The REFRESH state is not readable; that's when we are re-reading our state
out of leveldb, and we hold the mon_lock during the period. So, strictly
speaking, it doesn't matter whether we include it here since none of these
call sites would be visited while in that state.
Sage Weil [Sun, 17 Aug 2014 05:29:04 +0000 (22:29 -0700)]
mon/Paxos: move post-commit finish work into commit_finish()
The main change here is that we are merging the singleton and clustered
finish code together. This is mostly a code shuffle, except for one
semantic change: we now trigger the commit waiters before finish_round()
in the singleton case, whereas before we did not. I don't think there
was a specific reason why it differed from the clustered case.
Dan Mick [Tue, 12 Aug 2014 23:31:22 +0000 (16:31 -0700)]
ceph.spec.in: tests for rhel or centos need to not include _version
rhel_version and centos_version are apparently the OpenSUSE Build
names; the native macros are just "rhel" and "centos" (and contain
a version number, should it be necessary).
Dan Mick [Tue, 12 Aug 2014 21:09:43 +0000 (14:09 -0700)]
ceph.spec.in: No version on ceph-libs Obsoletes.
If we are installing with the new package structure we don't ever want the
new package to co-exist with the old one; this includes the mistakenly-
released v0.81 on Fedora, which should be removed in favor of this
version.
Signed-off-by: Sandon Van Ness <sandon@inktank.com> Reviewed-by: Dan Mick <dan.mick@inktank.com>
Erik Logtenberg [Thu, 31 Jul 2014 22:13:50 +0000 (00:13 +0200)]
ceph.spec.in, init-ceph.in: Don't autostart ceph service on Fedora.
This patch is taken from the current Fedora package and makes the upstream
ceph.spec compliant with Fedora policy. The goal is to be fully compliant
upstream so that we can replace current Fedora package with upstream
package to fix many bugs in Fedora.
Addition from Dan Mick <dan.mick@inktank.com>:
Do this for RHEL and Centos as well, since they surely will benefit
from the same policy. Note: this requires changes to
autobuild-ceph and ceph-build scripts, which currently copy
only the dist tarball to the rpmbuild/SOURCES dir.
Signed-off-by: Erik Logtenberg <erik@logtenberg.eu> Signed-off-by: Dan Mick <dan.mick@inktank.com>:
Erik Logtenberg [Thu, 31 Jul 2014 21:49:56 +0000 (23:49 +0200)]
ceph.spec.in: add ceph-libs-compat
Added a ceph-libs-compat package in accordance with Fedora packaging
guidelines [1], to handle the recent package split more gracefully.
In Fedora this is necessary because there are already other packages
depending on ceph-libs, that need to be adjusted to depend on the new
split packages instead. In the mean time, ceph-libs-compat prevents
breakage.
Sage Weil [Wed, 27 Aug 2014 13:19:12 +0000 (06:19 -0700)]
osd/PG: fix crash from second backfill reservation rejection
If we get more than one reservation rejection we should ignore them; when
we got the first we already sent out cancellations. More importantly, we
should not crash.
Fixes: #8863
Backport: firefly, dumpling Signed-off-by: Sage Weil <sage@redhat.com>
common: config: let us obtain a diff between current and default config
It's mildly annoying when trying to figure out what has been changed on
a running system's config options and having to rely on whatever is set
on ceph.conf and the admin's memory of what has been injected.
With this we can simply ask the daemon for the diff between what would be
its default and what is its current config.
Current form will output extraneous information that was not directly
supplied by the user though, such as 'host' 'fsid' and 'daemonize', as
well as defaults we may rewrite ourselves (leveldb tunables on the monitor
for instance). Nonetheless, it's way better than the alternative and
considering it should be used solely for debug purposes I think we can
get away with it.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Sage Weil [Tue, 26 Aug 2014 15:16:29 +0000 (08:16 -0700)]
osd/OSDMap: encode blacklist in deterministic order
When we use an unordered_map the encoding order is non-deterministic,
which is problematic for OSDMap. Construct an ordered map<> on encode
and use that. This lets us keep the hash table for lookups in the general
case.
Fixes: #9211
Backport: firefly Signed-off-by: Sage Weil <sage@redhat.com>
Loic Dachary [Mon, 25 Aug 2014 15:05:04 +0000 (17:05 +0200)]
common: ROUND_UP_TO accepts any rounding factor
The ROUND_UP_TO function was limited to rounding factors that are powers
of two. This saves a modulo but it is not used where it would make a
difference. The implementation is changed so it is generic.
We need to identify whether an object is just composed of a head, or
also has a tail. Test for pre-firefly objects ("explicit objs") was
broken as it was just looking at the number of explicit objs in the
manifest. However, this is insufficient, as we might have empty head,
and in this case it wouldn't appear, so we need to check whether the
sole object is actually pointing at the head.
Somnath Roy [Mon, 18 Aug 2014 23:59:36 +0000 (16:59 -0700)]
CollectionIndex: Collection name is added to the access_lock name
The CollectionIndex constructor is changed to accept the coll_t
so that the collection name can be used to form access_lock(RWLock)
name.This is needed otherwise lockdep will report a recursive lock error
and assert. lockdep needs unique lock names for each Index object.
Sage Weil [Mon, 25 Aug 2014 04:18:00 +0000 (21:18 -0700)]
msg/Accepter: do not unlearn_addr on bind()
It is dangerous to set need_addr = true as it means someone may set the
addr to something else (specifically the port) in a racing thread.
However, it is not necessary: the only reason we added it way back in 5d5045d31a9e10d21b44eb1bd137db9ae53128ff was so that
local_connection->peer_addr would get updated, and bind() now calls that
unconditionally.
Fixes: #9079
Backport: firefly Signed-off-by: Sage Weil <sage@redhat.com>
John Spray [Mon, 25 Aug 2014 00:45:22 +0000 (01:45 +0100)]
osd: update handle_osd_map call
I had changed the implementation in Objecter
to avoid a spurious get/put cycle in "osdc/Objecter: fix resource
management", but this guy was still going a get() before
calling handle_osd_map.
John Spray [Mon, 25 Aug 2014 00:16:39 +0000 (01:16 +0100)]
osdc/Objecter: fix op_cancel on homeless session
Wrote this block without realizing that op_cancel
takes write lock on session lock, and that operation
is undefined when you already hold the read lock.
Fixes: #9214 Signed-off-by: John Spray <john.spray@redhat.com>