Ma Jianpeng [Thu, 17 Jul 2014 00:48:34 +0000 (17:48 -0700)]
mon: OSDMonitor: add "osd pool get <pool> erasure_code_profile" command
Enable us to obtain the erasure-code-profile for a given erasure-pool.
Signed-off-by: Ma Jianpeng <jianpeng.ma@intel.com> Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit e8ebcb79a462de29bcbabe40ac855634753bb2be)
Sage Weil [Thu, 24 Jul 2014 01:25:53 +0000 (18:25 -0700)]
osd/ReplicatedPG: observe INCOMPLETE_CLONES in is_present_clone()
We cannot assume that just because cache_mode is NONE that we will have
all clones present; check for the absense of the INCOMPLETE_CLONES flag
here too.
Sage Weil [Thu, 24 Jul 2014 01:24:51 +0000 (18:24 -0700)]
osd/ReplicatedPG: observed INCOMPLETE_CLONES when doing clone subsets
During recovery, we can clone subsets if we know that all clones will be
present. We skip this on caching pools because they may not be; do the
same when INCOMPLETE_CLONES is set.
Sage Weil [Thu, 24 Jul 2014 01:23:56 +0000 (18:23 -0700)]
osd/ReplicatedPG: do not complain about missing clones when INCOMPLETE_CLONES is set
When scrubbing, do not complain about missing cloens when we are in a
caching mode *or* when the INCOMPLETE_CLONES flag is set. Both are
indicators that we may be missing clones and that that is okay.
Sage Weil [Thu, 24 Jul 2014 01:21:38 +0000 (18:21 -0700)]
osd/osd_types: add pg_pool_t FLAG_COMPLETE_CLONES
Set a flag on the pg_pool_t when we change cache_mode NONE. This
is because object promotion may promote heads without all of the clones,
and when we switch the cache_mode back those objects may remain. Do
this on any cache_mode change (to or from NONE) to capture legacy
pools that were set up before this flag existed.
mon: OSDMonitor: 'osd pool' - if we can set it, we must be able to get it
Add support to get the values for the following variables:
- target_max_objects
- target_max_bytes
- cache_target_dirty_ratio
- cache_target_full_ratio
- cache_min_flush_age
- cache_min_evict_age
If the test is run against a cluster started with vstart.sh (which is
the case for make check), the --asok-does-not-need-root disables the use
of sudo and allows the test to run without requiring privileged user
permissions.
Commit 7dc93a9651f602d9c46311524fc6b54c2f1ac595 fixed an incorrect
behavior with the OSD's 'osd bench' value hard-caps. The test wasn't
appropriately modified unfortunately.
qa/workunits: cephtool: split into properly indented functions
The test was a big sequence of commands being run and it has been growing
organically for a while, even though it has maintained a sense of
locality with regard to the portions being tested.
This patch intends to split the commands into functions, allowing for a
better semantic context and easier expansion. On the other hand, this
will also allow us to implement mechanisms to run specific portions of
the test instead of always having to run the whole thing just to test a
couple of lines down at the bottom (or have to creatively edit the test).
mon: OSDMonitor: be scary about inconsistent pool tier ids
We may not crash your cluster, but you'll know that this is not something
that should have happened. Big letters makes it obvious. We'd make them
red too if we bothered to look for the ANSI code.
Sage Weil [Tue, 22 Jul 2014 20:16:11 +0000 (13:16 -0700)]
osd/ReplicatedPG: greedily take write_lock for copyfrom finish, snapdir
In the cases where we are taking a write lock and are careful
enough that we know we should succeed (i.e, we assert(got)),
use the get_write_greedy() variant that skips the checks for
waiters (be they ops or backfill) that are normally necessary
to avoid starvation. We don't care about staration here
because our op is already in-progress and can't easily be
aborted, and new ops won't start because they do make those
checks.
Sage Weil [Tue, 22 Jul 2014 20:11:42 +0000 (13:11 -0700)]
osd: allow greedy get_write() for ObjectContext locks
There are several lockers that need to take a write lock
because there is an operation that is already in progress and
know it is safe to do so. In particular, they need to skip
the starvation checks (op waiters, backfill waiting).
Sage Weil [Wed, 2 Jul 2014 17:38:43 +0000 (10:38 -0700)]
qa/workunits/rest/test.py: make osd create test idempotent
Avoid possibility that we create multiple OSDs do to retries by passing in
the optional uuid arg. (A stray osd id will make the osd tell tests a
few lines down fail.)
Sage Weil [Fri, 9 May 2014 15:41:33 +0000 (08:41 -0700)]
osd: cancel agent_timer events on shutdown
We need to cancel all agent timer events on shutdown. This also needs to
happen early so that any in-progress events will execute before we start
flushing and cleaning up PGs.
Sage Weil [Tue, 8 Jul 2014 23:11:44 +0000 (16:11 -0700)]
osd: s/applying repop/canceling repop/
The 'applying' language dates back to when we would wait for acks from
replicas before applying writes locally. We don't do any of that any more;
now, this loop just cancels the repops with remove_repop() and some other
cleanup.
Sage Weil [Tue, 8 Jul 2014 23:10:58 +0000 (16:10 -0700)]
osd: separate cleanup from PGBackend::on_change()
The generic portion of on_change() cleaned up temporary on-disk objects
and requires a Transaction. The rest is clearing out in-memory state and
does not. Separate the two.
mon: AuthMonitor: always encode full regardless of keyserver having keys
On clusters without cephx, assuming an admin never added a key to the
cluster, the monitors have empty key servers. A previous patch had the
AuthMonitor not encoding an empty keyserver as a full version.
As such, whenever the monitor restarts we will have to read the whole
state from disk in the form of incrementals. This poses a problem upon
trimming, as we do every now and then: whenever we start the monitor, it
will start with an empty keyserver, waiting to be populated from whatever
we have on disk. This is performed in update_from_paxos(), and the
AuthMonitor's will rely on the keyserver version to decide which
incrementals we care about -- basically, all versions > keyserver version.
Although we started with an empty keyserver (version 0) and are expecting
to read state from disk, in this case it means we will attempt to read
version 1 first. If the cluster has been running for a while now, and
even if no keys have been added, it's fair to assume that version is
greater than 0 (or even 1), as the AuthMonitor also deals and keeps track
of auth global ids. As such, we expect to read version 1, then version 2,
and so on. If we trim at some point however this will not be possible,
as version 1 will not exist -- and we will assert because of that.
This is fixed by ensuring the AuthMonitor keeps track of full versions
of the key server, even if it's of an empty key server -- it will still
keep track of the key server's version, which is incremented each time
we update from paxos even if it is empty.
rgw: account common prefixes for MaxKeys in bucket listing
To be more in line with the S3 api. Beforehand we didn't account the
common prefixes towards the MaxKeys (a single common prefix counts as a
single key). Also need to adjust the marker now if it is pointing at a
common prefix.
If found a prefix, calculate a string greater than that so that next
request we can skip to that. This is still not the most efficient way to
do it. It'll be better to push it down to the objclass, but that'll
require a much bigger change.
Sage Weil [Thu, 17 Jul 2014 23:40:06 +0000 (16:40 -0700)]
logrotate.conf: fix osd log rotation under upstart
In commit 7411c3c6a42bef5987bdd76b1812b01686303502 we generalized this
enumeration code by copying what was in the upstart scripts. However,
while the mon and mds directories get a 'done' file, the OSDs get a 'ready'
file. Bah! Trigger off of either one.
rgw: don't try to wait for pending if list is empty
Fixes: #8846
Backport: firefly, dumpling
This was broken at ea68b9372319fd0bab40856db26528d36359102e. We ended
up calling wait_pending_front() when pending list was empty.
This commit also moves the need_to_wait check to a different place,
where we actually throttle (and not just drain completed IOs).
Sage Weil [Wed, 16 Jul 2014 01:11:41 +0000 (18:11 -0700)]
init-ceph: wrap daemon startup with systemd-run when running under systemd
We want to make sure the daemon runs in its own systemd environment. Check
for systemd as pid 1 and, when present, use systemd-run -r <cmd> to do
this.
Probably fixes #7627
Signed-off-by: Sage Weil <sage@redhat.com> Reviewed-by: Dan Mick <dan.mick@inktank.com> Tested-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit 3e0d9800767018625f0e7d797c812aa44c426dab)
A while ago we bumped the head version and reset the compat version to 0.
Doing this so happens to make the messenger assume that the message does
not support the compat versioning and sets the compat version to the head
version -- thus making compat = 2 when it should have been 1.
The nasty side-effect of this is that upgrading from emperor to firefly
will have emperor-leaders being unable to decode forwarded messages from
firefly-peons.
Dan Mick [Thu, 3 Jul 2014 23:08:44 +0000 (16:08 -0700)]
Fix/add missing dependencies:
- rbd-fuse depends on librados2/librbd1
- ceph-devel depends on specific releases of libs and libcephfs_jni1
- librbd1 depends on librados2
- python-ceph does not depend on libcephfs1
mon: check changes to the whole CRUSH map and to tunables against cluster features
When we change the tunables, or set a new CRUSH map, we need to make sure it's
supported by all the monitors and OSDs currently participating in the cluster.
Disable this test until hitget-get reliably works on EC pools (currently
it does not, and this test usually passes only because we get the in-memory
HitSet).
Yehuda Sadeh [Wed, 11 Jun 2014 23:50:41 +0000 (16:50 -0700)]
rgw: set a default data extra pool name
Fixes: #8585
Have a default name for the data extra pool, otherwise it would be empty
which means that it'd default to the data pool name (which is a problem
with ec backends).
Yehuda Sadeh [Tue, 6 May 2014 22:35:20 +0000 (15:35 -0700)]
rgw: extend manifest to avoid old style manifest
In case we hit issue #8269 we'd like to avoid creating an old style
manifest. Since we need to have parts that use different prefix we add a
new rule param that overrides the manifest prefix.
Yehuda Sadeh [Sat, 3 May 2014 00:06:05 +0000 (17:06 -0700)]
rgw: don't allow multiple writers to same multiobject part
Fixes: #8269
Backport: firefly, dumpling
A client might need to retry a multipart part write. The original thread
might race with the new one, trying to clean up after it, clobbering the
part's data.
The fix is to detect whether an original part already existed, and if so
use a different part name for it.