Sage Weil [Thu, 31 Jul 2014 16:13:11 +0000 (09:13 -0700)]
osd/ReplicatedPG: check agent_mode if agent is enabled but hit_sets aren't
It is probably not a good idea to try to run the tiering agent without a
hit_set to inform its actions, but it is technically possible. For
example, one could simply blindly evict when we reach the full point.
However, this doesn't work because the agent mode is guarded by a hit_set
check, even though agent_setup() is not. Fix that.
In lookup_pool and pool_delete, a lock is taken
before invoking wait_for_osdmap, but is not
released for the failure case of the call. Fixing the same.
Josh Durgin [Mon, 11 Aug 2014 23:41:26 +0000 (16:41 -0700)]
librbd: fix error path cleanup for opening an image
If the image doesn't exist and caching is enabled, the ObjectCacher
was not being shutdown, and the ImageCtx was leaked. The IoCtx could
later be closed while the ObjectCacher was still running, resulting in
a segfault. Simply use the usual cleanup path in open_image(), which
works fine here.
Haomai Wang [Mon, 14 Jul 2014 06:27:17 +0000 (14:27 +0800)]
Add rbdcache max dirty object option
Librbd will calculate max dirty object according to rbd_cache_max_size, it
doesn't suitable for every case. If user set image order 24, the calculating
result is too small for reality. It will increase the overhead of trim call
which is called each read/write op.
Now we make it as option for tunning, by default this value is calculated.
Danny Al-Gaaf [Wed, 4 Jun 2014 21:22:18 +0000 (23:22 +0200)]
librbd/internal.cc: check earlier for null pointer
Fix potential null ponter deref, move check for 'order != NULL'
to the beginning of the function to prevent a) deref in ldout() call
and b) to leave function as early as possible if check fails.
[src/librbd/internal.cc:843] -> [src/librbd/internal.cc:865]: (warning)
Possible null pointer dereference: order - otherwise it is redundant
to check it against null.
librbd: check return code and error out if invalidate_cache fails
This will only happen when shrinking or rolling back an image is done
while other I/O is in flight to the same ImageCtx. This is unsafe, so
return an error before performing the resize or rollback.
Sage Weil [Sat, 19 Jul 2014 06:16:09 +0000 (23:16 -0700)]
os/LFNIndex: only consider alt xattr if nlink > 1
If we are doing a lookup, the main xattr fails, we'll check if there is an
alt xattr. If it exists, but the nlink on the inode is only 1, we will
kill the xattr. This cleans up the mess left over by an incomplete
lfn_unlink operation.
This resolves the problem with an lfn_link to a second long name that
hashes to the same short_name: we will ignore the old name the moment the
old link goes away.
Sage Weil [Sat, 19 Jul 2014 00:28:18 +0000 (17:28 -0700)]
os/LFNIndex: remove alt xattr after unlink
After we unlink, if the nlink on the inode is still non-zero, remove the
alt xattr. We can *only* do this after the rename or unlink operation
because we don't want to leave a file system link in place without the
matching xattr; hence the fsync_dir() call.
Note that this might leak an alt xattr if we happen to fail after the
rename/unlink but before the removexattr is committed. We'll fix that
next.
Sage Weil [Sat, 19 Jul 2014 00:09:07 +0000 (17:09 -0700)]
os/LFNIndex: handle long object names with multiple links (i.e., rename)
When we rename an object (collection_move_rename) to a different name, and
the name is long, we run into problems because the lfn xattr can only track
a single long name linking to the inode. For example, suppose we have
foobar -> foo_123_0 (attr: foobar) where foobar hashes to 123.
At first, collection_add could only link a file to another file in a
different collection with the same name. Allowing collection_move_rename
to rename the file, however, means that we have to convert:
col1/foobar -> foo_123_0 (attr: foobar)
to
col1/foobaz -> foo_234_0 (attr: foobaz)
This is a problem because if we link, reset xattr, unlink we end up with
col1/foobar -> foo_123_0 (attr: foobaz)
if we restart after we reset the attr. This will cause the initial foobar
lookup to since the attr doesn't match, and the file won't be able to be
looked up.
Fix this by allow *two* (long) names to link to the same inode. If we
lfn_link a second (different) name, move the previous name to the "alt"
xattr and set the new name. (This works because link is always followed
by unlink.) On lookup, check either xattr.
Don't even bother to remove the alt xattr on unlink. This works as long
as the old name and new name don't hash to the same shortname and end up
in the same LFN chain. (Don't worry, we'll fix that next.)
Only decode + characters to spaces if we're in a query argument. The +
query argument. The + => ' ' translation is not correct for
file/directory names.
rgw: call processor->handle_data() again if needed
Fixes: #8937
Following the fix to #8928 we end up accumulating pending data that
needs to be written. Beforehand it was working fine because we were
feeding it with the exact amount of bytes we were writing.
Fixes: #8442
Backport: firefly
Data pools might have strict write alignment requirements. Use pool
alignment info when setting the max_chunk_size for the write.
Sage Weil [Thu, 5 Jun 2014 17:43:16 +0000 (10:43 -0700)]
include/atomic: make 32-bit atomic64_t unsigned
This fixes
In file included from test/perf_counters.cc:19:0:
./common/perf_counters.h: In member function ‘std::pair PerfCounters::perf_counter_data_any_d::read_avg() const’:
warning: ./common/perf_counters.h:156:36: comparison between signed and unsigned integer expressions [-Wsign-compare]
} while (avgcount2.read() != count);
^
~~~~
./include/atomic.h: In member function 'size_t ceph::atomic_t::inc()':
./include/atomic.h:42:36: error: 'AO_fetch_and_add1' was not declared in this scope
return AO_fetch_and_add1(&val) + 1;
^
./include/atomic.h: In member function 'size_t ceph::atomic_t::dec()':
./include/atomic.h:45:42: error: 'AO_fetch_and_sub1_write' was not declared in this scope
return AO_fetch_and_sub1_write(&val) - 1;
^
./include/atomic.h: In member function 'void ceph::atomic_t::add(size_t)':
./include/atomic.h:48:36: error: 'AO_fetch_and_add' was not declared in this scope
AO_fetch_and_add(&val, add_me);
^
./include/atomic.h: In member function 'void ceph::atomic_t::sub(int)':
./include/atomic.h:52:48: error: 'AO_fetch_and_add_write' was not declared in this scope
AO_fetch_and_add_write(&val, (AO_t)negsub);
^
./include/atomic.h: In member function 'size_t ceph::atomic_t::dec()':
./include/atomic.h:46:5: warning: control reaches end of non-void function [-Wreturn-type]
}
^
make[5]: *** [cls/user/cls_user_client.o] Error 1
~~~~
Sage Weil [Wed, 30 Jul 2014 20:40:33 +0000 (13:40 -0700)]
test/cli-integration/rbd: fix trailing space
Newer versions of json.tool remove the trailing ' ' after the comma. Add
it back in with sed so that the .t works on both old and new versions, and
so that we don't have to remove the trailing spaces from all of the test
cases.
Currently for pools with different rules, "ceph df" cannot report
right available space for them, respectively. For detail assisment
of the bug ,pls refer to bug report #8943
This patch fix this bug and make ceph df works correctlly.
Sage Weil [Sun, 11 May 2014 20:36:03 +0000 (13:36 -0700)]
mon: include 'max avail' in df output
Include an estimate of the maximum writeable space for each pool. Note
that this value is a conservative estimate for that pool based on the
most-full OSD. It is also potentially misleading as it is the available
space if *all* new data were written to this pool; one cannot (generally)
add up the available space for all pools.
Guang Yang [Wed, 9 Jul 2014 11:20:36 +0000 (11:20 +0000)]
Fix the PG listing issue which could miss objects for EC pool (where there is object shard and generation).
Backport: firefly Signed-off-by: Guang Yang (yguang@yahoo-inc.com)
(cherry picked from commit 228760ce3a7109f50fc0f8e3c4a5697a423cb08f)
Sage Weil [Fri, 25 Jul 2014 21:48:10 +0000 (14:48 -0700)]
osd/ReplicatedPG: requeue cache full waiters if no longer writeback
If the cache is full, we block some requests, and then we change the
cache_mode to something else (say, forward), the full waiters don't get
requeued until the cache becomes un-full. In the meantime, however, later
requests will get processed and redirected, breaking the op ordering.
Fix this by requeueing any full waiters if we see that the cache_mode is
not writeback.
Ma Jianpeng [Thu, 17 Jul 2014 00:48:34 +0000 (17:48 -0700)]
mon: OSDMonitor: add "osd pool get <pool> erasure_code_profile" command
Enable us to obtain the erasure-code-profile for a given erasure-pool.
Signed-off-by: Ma Jianpeng <jianpeng.ma@intel.com> Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit e8ebcb79a462de29bcbabe40ac855634753bb2be)
Sage Weil [Thu, 24 Jul 2014 01:25:53 +0000 (18:25 -0700)]
osd/ReplicatedPG: observe INCOMPLETE_CLONES in is_present_clone()
We cannot assume that just because cache_mode is NONE that we will have
all clones present; check for the absense of the INCOMPLETE_CLONES flag
here too.
Sage Weil [Thu, 24 Jul 2014 01:24:51 +0000 (18:24 -0700)]
osd/ReplicatedPG: observed INCOMPLETE_CLONES when doing clone subsets
During recovery, we can clone subsets if we know that all clones will be
present. We skip this on caching pools because they may not be; do the
same when INCOMPLETE_CLONES is set.
Sage Weil [Thu, 24 Jul 2014 01:23:56 +0000 (18:23 -0700)]
osd/ReplicatedPG: do not complain about missing clones when INCOMPLETE_CLONES is set
When scrubbing, do not complain about missing cloens when we are in a
caching mode *or* when the INCOMPLETE_CLONES flag is set. Both are
indicators that we may be missing clones and that that is okay.
Sage Weil [Thu, 24 Jul 2014 01:21:38 +0000 (18:21 -0700)]
osd/osd_types: add pg_pool_t FLAG_COMPLETE_CLONES
Set a flag on the pg_pool_t when we change cache_mode NONE. This
is because object promotion may promote heads without all of the clones,
and when we switch the cache_mode back those objects may remain. Do
this on any cache_mode change (to or from NONE) to capture legacy
pools that were set up before this flag existed.
mon: OSDMonitor: 'osd pool' - if we can set it, we must be able to get it
Add support to get the values for the following variables:
- target_max_objects
- target_max_bytes
- cache_target_dirty_ratio
- cache_target_full_ratio
- cache_min_flush_age
- cache_min_evict_age
If the test is run against a cluster started with vstart.sh (which is
the case for make check), the --asok-does-not-need-root disables the use
of sudo and allows the test to run without requiring privileged user
permissions.
Commit 7dc93a9651f602d9c46311524fc6b54c2f1ac595 fixed an incorrect
behavior with the OSD's 'osd bench' value hard-caps. The test wasn't
appropriately modified unfortunately.
qa/workunits: cephtool: split into properly indented functions
The test was a big sequence of commands being run and it has been growing
organically for a while, even though it has maintained a sense of
locality with regard to the portions being tested.
This patch intends to split the commands into functions, allowing for a
better semantic context and easier expansion. On the other hand, this
will also allow us to implement mechanisms to run specific portions of
the test instead of always having to run the whole thing just to test a
couple of lines down at the bottom (or have to creatively edit the test).
mon: OSDMonitor: be scary about inconsistent pool tier ids
We may not crash your cluster, but you'll know that this is not something
that should have happened. Big letters makes it obvious. We'd make them
red too if we bothered to look for the ANSI code.
Sage Weil [Tue, 22 Jul 2014 20:16:11 +0000 (13:16 -0700)]
osd/ReplicatedPG: greedily take write_lock for copyfrom finish, snapdir
In the cases where we are taking a write lock and are careful
enough that we know we should succeed (i.e, we assert(got)),
use the get_write_greedy() variant that skips the checks for
waiters (be they ops or backfill) that are normally necessary
to avoid starvation. We don't care about staration here
because our op is already in-progress and can't easily be
aborted, and new ops won't start because they do make those
checks.
Sage Weil [Tue, 22 Jul 2014 20:11:42 +0000 (13:11 -0700)]
osd: allow greedy get_write() for ObjectContext locks
There are several lockers that need to take a write lock
because there is an operation that is already in progress and
know it is safe to do so. In particular, they need to skip
the starvation checks (op waiters, backfill waiting).
Sage Weil [Wed, 2 Jul 2014 17:38:43 +0000 (10:38 -0700)]
qa/workunits/rest/test.py: make osd create test idempotent
Avoid possibility that we create multiple OSDs do to retries by passing in
the optional uuid arg. (A stray osd id will make the osd tell tests a
few lines down fail.)
Loic Dachary [Wed, 18 Jun 2014 22:49:13 +0000 (00:49 +0200)]
erasure-code: create default profile if necessary
After an upgrade to firefly, the existing Ceph clusters do not have the
default erasure code profile. Although it may be created with
ceph osd erasure-code-profile set default
it was not included in the release notes and is confusing for the
administrator.
The *osd pool create* and *osd crush rule create-erasure* commands are
modified to implicitly create the default erasure code profile if it is
not found.
In order to avoid code duplication, the default erasure code profile
code creation that happens when a new firefly ceph cluster is created is
encapsulated in the OSDMap::get_erasure_code_profile_default method.
Conversely, handling the pending change in OSDMonitor is not
encapsulated in a function but duplicated instead. If it was a function
the caller would need a switch to distinguish between the case when goto
wait is needed, or goto reply or proceed because nothing needs to be
done. It is unclear if having a function would lead to smaller or more
maintainable code.
Sage Weil [Fri, 9 May 2014 15:41:33 +0000 (08:41 -0700)]
osd: cancel agent_timer events on shutdown
We need to cancel all agent timer events on shutdown. This also needs to
happen early so that any in-progress events will execute before we start
flushing and cleaning up PGs.
Sage Weil [Tue, 8 Jul 2014 23:11:44 +0000 (16:11 -0700)]
osd: s/applying repop/canceling repop/
The 'applying' language dates back to when we would wait for acks from
replicas before applying writes locally. We don't do any of that any more;
now, this loop just cancels the repops with remove_repop() and some other
cleanup.