06a245a added a section def to assembly files; I added it twice to
this file. There's no damage, but a compiler warning (on machines with
yasm installed)
Sage Weil [Thu, 13 Nov 2014 01:11:10 +0000 (17:11 -0800)]
osd/OSD: use OSDMap helper to determine if we are correct op target
Use the new helper. This fixes our behavior for EC pools where targetting
a different shard is not correct, while for replicated pools it may be. In
the EC case, it leaves the op hanging indefinitely in the OpTracker because
the pgid exists but as a different shard.
Sage Weil [Thu, 13 Nov 2014 01:04:35 +0000 (17:04 -0800)]
osd/OSDMap: add osd_is_valid_op_target()
Helper to check whether an osd is a given op target for a pg. This
assumes that for EC we always send ops to the primary, while for
replicated we may target any replica.
Josh Durgin [Wed, 12 Nov 2014 02:16:02 +0000 (18:16 -0800)]
qa: allow small allocation diffs for exported rbds
The local filesytem may behave slightly differently. This isn't
foolproof, but seems to be reliable enough on rhel7 rootfs, where
exact comparison was failing.
John Spray [Mon, 27 Oct 2014 12:02:17 +0000 (12:02 +0000)]
client: allow xattr caps in inject_release_failure
Because some test environments generate spurious
rmxattr operations, allow the client to release
'X' caps. Allows xattr operations to proceed
while still preventing client releasing other caps.
John Spray [Fri, 7 Nov 2014 11:34:43 +0000 (11:34 +0000)]
tools: fix MDS journal import
Previously it only worked on fresh filesystems which
hadn't been trimmed yet, and resulted in an invalid
trimmed_pos when expire_pos wasn't on an object
boundary.
Yan, Zheng [Mon, 27 Oct 2014 20:57:16 +0000 (13:57 -0700)]
client: fix I_COMPLETE_ORDERED checking
Current code marks a directory inode as complete and ordered when readdir
finishes, but it does not check if the directory was modified in the middle
of readdir. This is wrong, directory inode should not be marked as ordered
if it was modified during readddir
The fix is introduce a new counter to the inode data struct, we increase
the counter each time the directory is modified. When readdir finishes, we
check the counter to decide if the directory should be marked as ordered.
client: preserve ordering of readdir result in cache
Preserve ordering of readdir result in a list, so that the result of cached
readdir is consistant with uncached readdir.
As a side effect, this commit also removes the code that removes stale dentries.
This is OK because stale dentries does not have valid lease, they will be
filter out by the shared gen check in Client::_readdir_cache_cb()
client: introduce a new flag indicating if dentries in directory are sorted
When creating a file, Client::insert_dentry_inode() set the dentry's offset
based on directory's max offset. The offset does not reflect the real
postion of the dentry in directory. Later readdir reply from real postion
of the dentry in directory. Later readdir reply from MDS may change the
dentry's position/offset. This inconsistency can cause missing/duplicate
entries in readdir result if readdir is partly satisfied by dcache_readdir().
The fix is introduce a new flag indicating if dentries in directory are
sorted. We use _readdir_cache_cb() to handle readdir only when the flag is
set, clear the flag after creating/deleting/renaming file.
Loic Dachary [Thu, 2 Oct 2014 07:23:55 +0000 (09:23 +0200)]
tools: rados put /dev/null should write() and not create()
In the rados.cc special case to handle put an empty objects, use
write_full() instead of create().
A special case was introduced 6843a0b81f10125842c90bc63eccc4fd873b58f2
to create() an object if the rados put file is empty. Prior to this fix
an attempt to rados put an empty file was a noop. The problem with this
fix is that it is not idempotent. rados put an empty file twice would
fail the second time and rados put a file with one byte would succeed as
expected.
Yehuda Sadeh [Thu, 9 Oct 2014 17:20:27 +0000 (10:20 -0700)]
rgw: set length for keystone token validation request
Fixes: #7796
Backport: giany, firefly
Need to set content length to this request, as the server might not
handle a chunked request (even though we don't send anything).
Tested-by: Mark Kirkwood <mark.kirkwood@catalyst.net.nz> Signed-off-by: Yehuda Sadeh <yehuda@redhat.com>
(cherry picked from commit 3dd4ccad7fe97fc16a3ee4130549b48600bc485c)
Greg Farnum [Thu, 23 Oct 2014 00:16:31 +0000 (17:16 -0700)]
client: cast m->get_client_tid() to compare to 16-bit Inode::flushing_cap_tid
m->get_client_tid() is 64 bits (as it should be), but Inode::flushing_cap_tid
is only 16 bits. 16 bits should be plenty to let the cap flush updates
pipeline appropriately, but we need to cast in the proper direction when
comparing these differently-sized versions. So downcast the 64-bit one
to 16 bits.
Sage Weil [Thu, 18 Sep 2014 21:23:36 +0000 (14:23 -0700)]
mon: re-bootstrap if we get probed by a mon that is way ahead
During bootstrap we verify that our paxos commits overlap with the other
mons we will form a quorum with. If they do not, we do a sync.
However, it is possible we pass those checks, then fail to join a quorum
before the quorum moves ahead in time such that we no longer overlap.
Currently nothing kicks up back into a probing state to discover we need
to sync... we will just keep trying to call or join an election instead.
Fix this by jumping back to bootstrap if we get a probe that is ahead of
us. Only do this from non probe or sync states as these will be common;
it is only the active and electing states that matter (and probably just
electing!).
Sage Weil [Fri, 24 Oct 2014 16:32:20 +0000 (09:32 -0700)]
osdc/Objecter: fix tick_event handling in shutdown vs tick race
If we fail to cancel the tick_event, we rely on tick() itself to clear
tick_event. I'm not quite sure how we got this wrong in the previous
commit, but this boils down to two cases:
1) shutdown() successfully cancels the event and clears tick_event. tick()
never runs. tick_event == NULL when we finish.
2) shutdown() fails to cancel the event because it has already started. In
this case tick itself is blocking (or about to block) waiting on the
rlock. When it does run it will clear tick_event itself, then see
initiazed == 0 and exit without rescheduling.
Fixes: #9873 Signed-off-by: Sage Weil <sage@redhat.com>
Sage Weil [Fri, 24 Oct 2014 16:20:41 +0000 (09:20 -0700)]
common/Timer: recheck stopping before sleep if we dropped the lock
If we have safe_callbacks==false, the stopping flag may have changed while
we were doing our callback. Recheck it and exit to avoid a deadlock on
shutdown.
Noah Watkins [Thu, 23 Oct 2014 20:22:52 +0000 (13:22 -0700)]
java: fill in stat structure correctly
Added stat filling helper function but only stat and lstat were updated.
This patch makes fstat use it. Crucially the fstat wasn't updating the
mode flags.
Josh Durgin [Mon, 20 Oct 2014 20:29:13 +0000 (13:29 -0700)]
Objecter: resend linger ops on any interval change
Watch/notify ops need to be resent after a pg split occurs, as well as
a few other circumstances that the existing objecter checks did not
catch.
Refactor the check the OSD uses for this to add a version taking the
more basic types instead of the whole OSD map, and stash the needed
info when an op is sent.
Samuel Just [Wed, 22 Oct 2014 19:43:55 +0000 (12:43 -0700)]
FDCache: purge hoid on clear
We no longer require that a lock on the FD be held for the duration of an
operation, only while accessing the actual index. We cannot, therefore, assume
that a racing read during lfn_unlink (backfill or scrub) does not still have a
reference to the fd. We want to remove the fd from the cache to prevent
subsequent operations from finding it while allowing such a racing read to
complete with its existing fd.
Fixes: #9480 Signed-off-by: Samuel Just <sam.just@inktank.com>
Samuel Just [Wed, 22 Oct 2014 19:40:14 +0000 (12:40 -0700)]
shared_cache::add: do not delete value if existed
The method contract specifies that we do not want to delete
value if we are not inserting it, so do not initialize val
at the top of the function to take over value. No current
users appear to trip over this problem (FDCache and
map_cache).
We are dropping the requirement for MON_CAP_R for MMonGetMap.
Reason is simple enough: clients may need to contact the monitors and
obtain the latest monmap before authenticating. This happens, for
instance, when a client calls MonClient::get_monmap_privately(). The
osd uses this function during mkfs, prior to initializing a keyring or
even so much as existing.
Fixes: #9859 Signed-off-by: Joao Eduardo Luis <joao@redhat.com>
Samuel Just [Mon, 20 Oct 2014 21:10:58 +0000 (14:10 -0700)]
PG:: reset_interval_flush and in set_last_peering_reset
If we have a change in the prior set, but not in the up/acting set, we go back
through Reset in order to reset peering state. Previously, we would reset
last_peering_reset in the Reset constructor. This did not, however, reset the
flush_interval, which caused the eventual flush event to be ignored and the
peering messages to not be sent.
Instead, we will always reset_interval_flush if we are actually changing the
last_peering_reset value.
Fixes: #9821
Backport: firefly Signed-off-by: Samuel Just <sam.just@inktank.com>
Loic Dachary [Mon, 13 Oct 2014 12:48:27 +0000 (14:48 +0200)]
erasure-code: add ErasureCode::encode unit test
Re-create and describe the situation that is fixed by 91a7e18f60bbc9acab3045baaa1b6505474ec4a9 which reworks the buffer
preparation function provided by ErasureCode::encode.
Janne Grunau [Mon, 29 Sep 2014 12:34:31 +0000 (14:34 +0200)]
erasure code: use 32-byte aligned buffers
Requiring page aligned buffers and realigning the input if necessary
creates measurable oberhead. ceph_erasure_code_benchmark is between
10-20% faster depending on the workload.
Also prevents a misaligned buffer when bufferlist::c_str(bufferlist)
has to allocate a new buffer to provide continuous one. See bug #9408
Loic Dachary [Mon, 13 Oct 2014 14:32:18 +0000 (16:32 +0200)]
common: add an aligned buffer with less alignment than a page
SIMD optimized erasure code computation needs aligned memory. Buffers
aligned to a page boundary are not needed though. The buffers used
for the erasure code computation are typical smaller than a page.
The typical alignment requirements SIMD operations are 16 bytes for
SSE2 and NEON and 32 bytes for AVX/AVX2.
Add new prototypes with an align argument, similar to the one enforcing
page alignment. The implementation is exactly the same, except for the
align parameter. The page alignment method are then implemented as calls
to the more generic methods.
The align parameter is an unsigned (same type as CEPH_PAGE_SIZE). The
CEPH_PAGE_MASK value ( ~(CEPH_PAGE_SIZE - 1) ) was only used as
~CEPH_PAGE_MASK, i.e. equivalent of (CEPH_PAGE_SIZE - 1) once the double
~~ is reduced. These occurrence are replaced with (align - 1). The type
of CEPH_PAGE_MASK is an unsigned long which probably because it was
~(CEPH_PAGE_SIZE). When using (align - 1) as a mask for both
CEPH_PAGE_SIZE and SIMD alignment there is no need to use an unsigned
long because there is no risk of overflowing the unsigned value.
The CYGWIN specific code is also modified but not tested.
Unit tests are added for the new methods.
Signed-off-by: Janne Grunau <j@jannau.net> Signed-off-by: Loic Dachary <loic-201408@dachary.org>
Adam Crume [Wed, 8 Oct 2014 00:45:53 +0000 (17:45 -0700)]
Fix read performance regression in ObjectCacher
The regression was introduced in commit 4fc9fffc494abedac0a9b1ce44706343f18466f1. The problem is that the cache
thinks it's full (when it's not), so it defers the read. This change
frees up cache space if necessary and only defers the read if enough
space cannot be freed.
qa/workunits: cephtool: don't remove self's key on auth tests
Suites run with CEPH_TEST_CLI_DUP_COMMAND=1, which will send a duplicate
command for every command issued with the 'ceph' tool. Behavior is to
get a reply from the command and then send a duplicate, looking for the
same outcome (guaranteeing idempotency of the operations). However, it
so happens that if you remove the entity's own key from the keyring and
you happen to be unlucky enough so that the client's connection gets
failed (we also run tests with connection failure injections), the
'ceph' tool won't be able to reconnect to the cluster to send the
duplicate command (as it's entity no longer exists in the cluster's
keyring).
We rewrite the test instead of resorting to ugly hacks to work around
this behavior, simply having a new 'role-definer' added by the existing
'role-definer' (which we weren't testing anyway, so bonus points for
that) and then have one removing the other (to test the procedure) and
finally using 'client.admin' to remove the last 'role-definer'.
Fixes: #9820 Signed-off-by: Joao Eduardo Luis <joao@redhat.com>
Guang Yang [Mon, 13 Oct 2014 04:18:45 +0000 (04:18 +0000)]
The fix for issue 9614 was not completed, as a result, for those erasure coded PGs with one OSD down, the state was wrongly marked as active+clean+degraded. This patch makes sure the clean flag is not set for such PG. Signed-off-by: Guang Yang <yguang@yahoo-inc.com>
Greg Farnum [Thu, 9 Oct 2014 22:12:19 +0000 (15:12 -0700)]
mds: Locker: fix a NULL deref in _update_cap_fields
The MClientCaps* is allowed to be NULL, so we can't deref it unless
the dirty param is non-zero. So don't do the ahead-of-time lookup;
just call it explicitly in the if block.