Loic Dachary [Mon, 25 Aug 2014 15:05:04 +0000 (17:05 +0200)]
common: ROUND_UP_TO accepts any rounding factor
The ROUND_UP_TO function was limited to rounding factors that are powers
of two. This saves a modulo but it is not used where it would make a
difference. The implementation is changed so it is generic.
Sage Weil [Thu, 21 Aug 2014 20:05:35 +0000 (13:05 -0700)]
mon: fix occasional message leak after session reset
Consider:
- we get a message, put it on a wait list
- the client session resets
- we go back to process the message later and discard
- _ms_dispatch returns false, but nobody drops the msg ref
Since we call _ms_dispatch() a lot internally, we need to always return
true when we are an internal caller.
Loic Dachary [Thu, 21 Aug 2014 12:41:55 +0000 (14:41 +0200)]
erasure-code: preload the jerasure plugin variant (sse4,sse3,generic)
The preloading of the jerasure plugin ldopen the plugin that is in
charge of selecting the variant optimized for the
CPU (sse4,sse3,generic). The variant plugin itself is not loaded because
it does not happen at load() but when the factory() method is called.
The JerasurePlugin::preload method is modified to call the factory()
method to load jerasure_sse4 or jerasure_sse3 or jerasure_generic as a
side effect.
Indirectly loading another plugin in the factory() method is error prone
and should be moved to the load() method instead. This change should be
done in a separate commit.
Haomai Wang [Tue, 20 May 2014 06:32:18 +0000 (14:32 +0800)]
Fix set_alloc_hint op cause KeyValueStore crash problem
Now KeyValueStore doesn't support set_alloc_hit op, the implementation of
_do_transaction need to consider decoding the arguments. Otherwise, the
arguments will be regarded as the next op.
Fix the same problem for MemStore.
Fix #8381
Reported-by: Xinxin Shu <xinxin.shu5040@gmail.com> Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
(cherry picked from commit c08adbc98ff5f380ecd215f8bd9cf3cab214913c)
Loic Dachary [Mon, 18 Aug 2014 23:30:15 +0000 (01:30 +0200)]
erasure-code: preload the jerasure plugin
Load the jerasure plugin when ceph-osd starts to avoid the following
scenario:
* ceph-osd-v1 is running but did not load jerasure
* ceph-osd-v2 is installed being installed but takes time : the files
are installed before ceph-osd is restarted
* ceph-osd-v1 is required to handle an erasure coded placement group and
loads jerasure (the v2 version which is not API compatible)
* ceph-osd-v1 calls the v2 jerasure plugin and does not reference the
expected part of the code and crashes
Although this problem shows in the context of teuthology, it is unlikely
to happen on a real cluster because it involves upgrading immediately
after installing and running an OSD. Once it is backported to firefly,
it will not even happen in teuthology tests because the upgrade from
firefly to master will use the firefly version including this fix.
While it would be possible to walk the plugin directory and preload
whatever it contains, that would not work for plugins such as jerasure
that load other plugins depending on the CPU features, or even plugins
such as isa which only work on specific CPU.
Matt Benjamin [Thu, 29 May 2014 14:34:20 +0000 (10:34 -0400)]
Work around an apparent binding bug (GCC 4.8).
A reference to h->seq passed to std::pair ostensibly could not bind
because the header structure is packed. At first this looked like
a more general unaligned access problem, but the only location the
compiler rejects is a false positive.
common/config.cc: allow integer values to be parsed as SI units
We are allowing this for all and any integer values; that is, OPT_INT,
OPT_LONGLONG, OPT_U32 and OPT_U64.
It's on the user to use appropriate units. For instance, the user should
not use 'E(xabyte)' when setting a signed int, and use his best judgment
when setting options that, for instance, ought to receive seconds.
Sage Weil [Wed, 13 Aug 2014 00:25:10 +0000 (17:25 -0700)]
ceph-disk: use partition type UUIDs, and blkid
Use blkid to give us the GPT partition type. This lets us distinguish
between dmcrypt and non-dmcrypt partitions. Fake it if blkid doesn't
give us what we want and try with sgdisk. This isn't perfect (it can't
tell between dmcrypt and not dmcrypt), but such is life, and we are better
off than before.
Sage Weil [Mon, 11 Aug 2014 22:57:52 +0000 (15:57 -0700)]
ceph-disk: fix verify_no_in_use check
We only need to verify that partitions aren't in use when we want to
consume the whole device (osd data), not when we want to create an
additional partition for ourselves (osd journal).
Sage Weil [Fri, 15 Aug 2014 15:55:10 +0000 (08:55 -0700)]
osd: only require crush features for rules that are actually used
Often there will be a CRUSH rule present for erasure coding that uses the
new CRUSH steps or indep mode. If these rules are not referenced by any
pool, we do not need clients to support the mapping behavior. This is true
because the encoding has not changed; only the expected CRUSH output.
Sage Weil [Sun, 10 Aug 2014 19:15:38 +0000 (12:15 -0700)]
ceph_test_rados_api: fix cleanup of cache pool
We can't simply try to delete everything in there because some items may
be whiteouts. Instead, flush+evict everything, then remove overlay, and
*then* delete what remains.
Sage Weil [Wed, 13 Aug 2014 17:34:53 +0000 (10:34 -0700)]
osd/ReplicatedPG: only do agent mode calculations for positive values
After a split we can get negative values here. Only do the arithmetic if
we have a valid (positive) value that won't through the floating point
unit for a loop.
Fixes: #9082 Tested-by: Karan Singh <karan.singh@csc.fi> Signed-off-by: Sage Weil <sage@redhat.com>
(cherry picked from commit 5be56ff86d9f3ab2407a258a5285d0b8f52f041e)
OSD: introduce require_up_osd_peer() function for gating replica ops
This checks both that a Message originates from an OSD, and that the OSD
is up in the given map epoch.
We use it in handle_replica_op so that we don't inadvertently add operations
from down peers, who might or might not know it.
Samuel Just [Fri, 1 Aug 2014 21:04:35 +0000 (14:04 -0700)]
osd_types: s/stashed/rollback_info_completed and set on create
Originally, this flag indicated that the object had already been stashed and
that therefore recording subsequent changes is unecessary. We want to set it
on create() as well since operations like [create, writefull] should not need
to stash the object.
Sage Weil [Thu, 31 Jul 2014 18:02:55 +0000 (11:02 -0700)]
mon/OSDMonitor: warn when cache pools do not have hit_sets configured
Give users a clue when cache pools are enabled but the hit_set is not
configured. Note that technically this will work, but not well, so for
now let's just steer them away.
Sage Weil [Thu, 31 Jul 2014 16:13:11 +0000 (09:13 -0700)]
osd/ReplicatedPG: check agent_mode if agent is enabled but hit_sets aren't
It is probably not a good idea to try to run the tiering agent without a
hit_set to inform its actions, but it is technically possible. For
example, one could simply blindly evict when we reach the full point.
However, this doesn't work because the agent mode is guarded by a hit_set
check, even though agent_setup() is not. Fix that.
In lookup_pool and pool_delete, a lock is taken
before invoking wait_for_osdmap, but is not
released for the failure case of the call. Fixing the same.
Josh Durgin [Mon, 11 Aug 2014 23:41:26 +0000 (16:41 -0700)]
librbd: fix error path cleanup for opening an image
If the image doesn't exist and caching is enabled, the ObjectCacher
was not being shutdown, and the ImageCtx was leaked. The IoCtx could
later be closed while the ObjectCacher was still running, resulting in
a segfault. Simply use the usual cleanup path in open_image(), which
works fine here.
Haomai Wang [Mon, 14 Jul 2014 06:27:17 +0000 (14:27 +0800)]
Add rbdcache max dirty object option
Librbd will calculate max dirty object according to rbd_cache_max_size, it
doesn't suitable for every case. If user set image order 24, the calculating
result is too small for reality. It will increase the overhead of trim call
which is called each read/write op.
Now we make it as option for tunning, by default this value is calculated.
Danny Al-Gaaf [Wed, 4 Jun 2014 21:22:18 +0000 (23:22 +0200)]
librbd/internal.cc: check earlier for null pointer
Fix potential null ponter deref, move check for 'order != NULL'
to the beginning of the function to prevent a) deref in ldout() call
and b) to leave function as early as possible if check fails.
[src/librbd/internal.cc:843] -> [src/librbd/internal.cc:865]: (warning)
Possible null pointer dereference: order - otherwise it is redundant
to check it against null.
librbd: check return code and error out if invalidate_cache fails
This will only happen when shrinking or rolling back an image is done
while other I/O is in flight to the same ImageCtx. This is unsafe, so
return an error before performing the resize or rollback.
Sage Weil [Sat, 19 Jul 2014 06:16:09 +0000 (23:16 -0700)]
os/LFNIndex: only consider alt xattr if nlink > 1
If we are doing a lookup, the main xattr fails, we'll check if there is an
alt xattr. If it exists, but the nlink on the inode is only 1, we will
kill the xattr. This cleans up the mess left over by an incomplete
lfn_unlink operation.
This resolves the problem with an lfn_link to a second long name that
hashes to the same short_name: we will ignore the old name the moment the
old link goes away.
Sage Weil [Sat, 19 Jul 2014 00:28:18 +0000 (17:28 -0700)]
os/LFNIndex: remove alt xattr after unlink
After we unlink, if the nlink on the inode is still non-zero, remove the
alt xattr. We can *only* do this after the rename or unlink operation
because we don't want to leave a file system link in place without the
matching xattr; hence the fsync_dir() call.
Note that this might leak an alt xattr if we happen to fail after the
rename/unlink but before the removexattr is committed. We'll fix that
next.
Sage Weil [Sat, 19 Jul 2014 00:09:07 +0000 (17:09 -0700)]
os/LFNIndex: handle long object names with multiple links (i.e., rename)
When we rename an object (collection_move_rename) to a different name, and
the name is long, we run into problems because the lfn xattr can only track
a single long name linking to the inode. For example, suppose we have
foobar -> foo_123_0 (attr: foobar) where foobar hashes to 123.
At first, collection_add could only link a file to another file in a
different collection with the same name. Allowing collection_move_rename
to rename the file, however, means that we have to convert:
col1/foobar -> foo_123_0 (attr: foobar)
to
col1/foobaz -> foo_234_0 (attr: foobaz)
This is a problem because if we link, reset xattr, unlink we end up with
col1/foobar -> foo_123_0 (attr: foobaz)
if we restart after we reset the attr. This will cause the initial foobar
lookup to since the attr doesn't match, and the file won't be able to be
looked up.
Fix this by allow *two* (long) names to link to the same inode. If we
lfn_link a second (different) name, move the previous name to the "alt"
xattr and set the new name. (This works because link is always followed
by unlink.) On lookup, check either xattr.
Don't even bother to remove the alt xattr on unlink. This works as long
as the old name and new name don't hash to the same shortname and end up
in the same LFN chain. (Don't worry, we'll fix that next.)
Only decode + characters to spaces if we're in a query argument. The +
query argument. The + => ' ' translation is not correct for
file/directory names.
rgw: call processor->handle_data() again if needed
Fixes: #8937
Following the fix to #8928 we end up accumulating pending data that
needs to be written. Beforehand it was working fine because we were
feeding it with the exact amount of bytes we were writing.
Fixes: #8442
Backport: firefly
Data pools might have strict write alignment requirements. Use pool
alignment info when setting the max_chunk_size for the write.