Ilya Dryomov [Tue, 18 Mar 2014 16:06:12 +0000 (18:06 +0200)]
qa: test_alloc_hint: flush journal before prodding the FS
OSDs that for some reason get behind on processing their op queue break
expect_alloc_hint_eq(), as it pokes the FS and not the journal. Fix it
by flushing the journal before proceeding with anything else.
Ilya Dryomov [Tue, 18 Mar 2014 16:06:12 +0000 (18:06 +0200)]
osd: add flush_journal admin socket command
Add flush_journal admin socket command to be able to flush journal to
the permanent store for online osds. (For offline osds we already have
ceph-osd --flush-journal.)
Samuel Just [Sun, 16 Mar 2014 00:58:35 +0000 (17:58 -0700)]
OSD::handle_pg_query: on dne pg, send lb=hobject_t() if deleting
We will set lb=hobject_t() if we resurrect the pg. In that case,
we need to have sent that to the primary before hand. If we
finish the removal before the pg is recreated, we'll just end
up backfilling it, which is ok since the pg doesn't exist anyway.
Fixes: #7740 Signed-off-by: Samuel Just <sam.just@inktank.com>
Yan, Zheng [Sat, 15 Mar 2014 12:37:37 +0000 (20:37 +0800)]
mds: fix corner case of pushing inline data
Following sequence of events can happen.
- Client releases an inode, queues cap release message.
- A 'lookup' reply brings the same inode back, but the reply doesn't
contain inline data because MDS didn't receive the cap release
message and thought client already has up-to-data inline data.
The fix is trigger a getattr if client finds inline_version is zero.
The getattr mask is set to CEPH_STAT_CAP_INLINE_DATA, so that MDS knows
client does not have inline data.
Sage Weil [Fri, 14 Mar 2014 23:32:48 +0000 (16:32 -0700)]
osd/ReplicatedPG: fix enqueue_front race
When requeuing and item at the front, we need to shuffle the items in
pg_for_processing if there is an entry for this PG there. If so, we need
to hold the qlock for the duration of the requeue of the shuffled item
back into the primary queue in order to avoid reshuffling items. For
example, consider the queue has
A B C D
- dequeue1 gets (pg, A), puts A in the processing list
- dequeue1 tries to lock pg, blocks
- enqueue_front on X takes qlock, swaps it for A, drops qlock
- dequeue2 gets (pg, B), puts B in the processing list
- enqueue_front pushes X back into the original list
so we have processing: X B queue: A C D
- dequeue* get X, then B, then A C D
If we whole qlock for the duration of the enqueue_front, we avoid dequeu2
from sneaking in an shuffling B into the processing list before we have
crammed A back onto the front of the list.
This may have caused #7712.
Backport: emperor, dumpling Signed-off-by: Sage Weil <sage@inktank.com>
Samuel Just [Fri, 14 Mar 2014 21:48:31 +0000 (14:48 -0700)]
PG::issue_repop: only adjust peer_info last_updates if not temp
Temp object repops have version eversion_t() since they don't
actually send log entries. Updating the last_updates here
caused the peer info last_updates to be incorrect until the
next non-temp repop.
Fixes: #7718 Signed-off-by: Samuel Just <sam.just@inktank.com>
Danny Al-Gaaf [Wed, 12 Mar 2014 21:56:44 +0000 (22:56 +0100)]
RGWListBucketMultiparts: init max_uploads/default_max with 0
CID 717377 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
2. uninit_member: Non-static class member "max_uploads" is not initialized
in this constructor nor in any functions that it calls.
4. uninit_member: Non-static class member "default_max" is not initialized
in this constructor nor in any functions that it calls.
Sage Weil [Fri, 14 Mar 2014 19:46:57 +0000 (12:46 -0700)]
unittest_ceph_argparse: fix warnings
In file included from test/ceph_argparse.cc:17:0:
../src/gtest/include/gtest/gtest.h: In function ‘testing::AssertionResult testing::internal::CmpHelperEQ(const char*, const char*, const T1&, const T2&) [with T1 = int, T2 = long unsigned int]’:
../src/gtest/include/gtest/gtest.h:1333:30: instantiated from ‘static testing::AssertionResult testing::internal::EqHelper::Compare(const char*, const char*, const T1&, const T2&) [with T1 = int, T2 = long unsigned int]’
test/ceph_argparse.cc:344:207: instantiated from here
warning: ../src/gtest/include/gtest/gtest.h:1263:3: comparison between signed and unsigned integer expressions [-Wsign-compare]
Samuel Just [Fri, 14 Mar 2014 20:09:30 +0000 (13:09 -0700)]
PG: clear want_pg_temp in clear_primary_state only if primary
Clearing it in that way in on_shutdown() can cause a stray
shard to clobber the want_pg_temp value created by the primary
shard on the same osd. Thus, instead only clear it if we are
the primary.
Fixes: #7719 Signed-off-by: Samuel Just <sam.just@inktank.com>
Sage Weil [Fri, 14 Mar 2014 18:02:30 +0000 (11:02 -0700)]
mon: only do timecheck with known monmap
If we are still on monmap epoch 0, our mon ranks cannot yet be trusted
since there is not yet a shared source of truth from paxos. If we do
timechecks, the code gets confused about the ranks in e.g. the
timecheck_waiting map.
Fixes: #7692 Signed-off-by: Sage Weil <sage@inktank.com>
Samuel Just [Fri, 14 Mar 2014 01:16:19 +0000 (18:16 -0700)]
PG::activate: handle peer contigious with primary, but not auth_log
The added case covers a situation where a replica is not contiguous with
the auth_log, but is contiguous with the primary. Reshuffling the
active set to handle this would be tricky, so instead we just go ahead
and backfill it anyway. This is probably preferrable in any case since
the replica in question would have to be significantly behind.
Fixes: #7696 Signed-off-by: Samuel Just <sam.just@inktank.com>
ceph_mon: split postfork() in two and finish postfork just before daemonize
We split global_init_postfork() in two: start and finish, with the first
keeping much of postfork()'s tasks except closing stderr, which we leave
open until just before we daemonize. This allows the user to see any
error messages that the monitor may spit out before it daemonizes, making
sense of the error code (which we were already returning).
Fixes: 7489 Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Sage Weil [Fri, 14 Mar 2014 05:02:01 +0000 (22:02 -0700)]
osd/ReplicatedPG: release op locks on on commit+applied
We were releasing the op locks when we applied the update but (potentially)
before we committed it. This means that another client can read object
state that is not yet durable.
Fixes: #7709 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Mon, 10 Mar 2014 20:52:54 +0000 (13:52 -0700)]
osd: set default cache_target_{dirty,full}_ratios based on configurable
These were hard-coded in the pg_pool_t constructor, but that was a dumb
idea.
Note that decoding legacy pg_pool_t's no longer does what it used to. I'm
pretty sure that's okay since we care less about interim releases and
because we are pulling these normally out of OSDMap, which is freshly
encoded on a regular basis (and certainly recently with real values). Also,
let's not forget that this field is meaningless on old pools anyway.
Samuel Just [Thu, 13 Mar 2014 21:04:19 +0000 (14:04 -0700)]
PrioritizedQueue: cap costs at max_tokens_per_subqueue
Otherwise, you can get a recovery op in the queue which has a cost
higher than the max token value. It won't get serviced until all other
queues also do not have enough tokens and higher priority queues are
empty.
Fixes: #7706 Signed-off-by: Samuel Just <sam.just@inktank.com>
Yehuda Sadeh [Thu, 13 Mar 2014 18:25:24 +0000 (11:25 -0700)]
rgw: manifest hold the actual bucket used for tail objects
Fixes: 7703
Object can be copied between different buckets, so we need to keep track
of which bucket is used for naming the tail parts. The new manifest
requires that because older manifest just held all the tail objects
(each containing the appropriate bucket internally).
Sage Weil [Thu, 13 Mar 2014 18:22:34 +0000 (11:22 -0700)]
rbd-fuse: fix signed/unsigned warning
rbd_fuse/rbd-fuse.c: In function 'enumerate_images':
rbd_fuse/rbd-fuse.c:113:2: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
Samuel Just [Tue, 11 Mar 2014 21:23:10 +0000 (14:23 -0700)]
PG: do not wait for flushed before activation
This should reduce the sting of the previous commit somewhat. We wait
for the activation transactions to clear prior to accepting IO anyway,
so we can go ahead and get that process started without waiting for the
flush.
Samuel Just [Tue, 11 Mar 2014 17:31:55 +0000 (10:31 -0700)]
PG: do not serve requests until replicas have activated
There are two problems:
1) We choose the min last_update amoung peers with the max local-les
value as an upper bound on requests which could have been reported to
the client as committed. We then, for ec pools, roll back to that point
to ensure that we don't inadvertently commit to an update which fewer
than K replicas actually saw. If the primary sets local-les, accepts an
update from a client, and there is a new interval before any of the
replicas have been activated, we will end up being forced to use that
update which no other replica has seen as the new last_update. This
will cause the object to become unfound. We don't have this problem as
long as all active replicas agree on last_update before we accept IO.
2) Even for replicated pools, we would then immediately respond to the
request which created the primary-only update with a commit since it is
in the log and we have no outstanding repops. If we then lose that
primary before any of the replicas in the new interval record the new
log, we will not only lose the object, but also the log entry recording
it, which will result in a lost write.
For these reasons, it seems like we need to wait for the replicas to
activate before we can process new requests essentially because whatever
update we select as last_update is essentially regarded as committed as
soon as we accept IO.
Fixes: #7649 Signed-off-by: Samuel Just <sam.just@inktank.com>
Sage Weil [Wed, 12 Mar 2014 04:22:57 +0000 (21:22 -0700)]
debian: make ceph depend on ceph-common >= 0.67
The older versions of ceph-common (ceph CLI, in particular) can't talk to
newer clusters. The primary change happened with dumpling when the new
CLI and rest-api changes were made. Although in reality ceph doesn't
care what version of ceph-common is installed, in practice this forces
ceph-common to get upgraded along with ceph and avoids some user pain.
Fixes: #7641 Signed-off-by: Sage Weil <sage@inktank.com>
Yehuda Sadeh [Wed, 12 Mar 2014 01:19:44 +0000 (18:19 -0700)]
rgw: don't overwrite bucket entry data when syncing user stats
Fixes: #7687
When syncing user bucket stats we overwritten the entire entry with the
passed in entry. We should only look at the stats portion, and not
overwrite the rest (which contains bucket creation time).
Image names buffer is fixed at 1024. This turns out to be not enough:
there are at least two "rbd-fuse rbd_list: error %d Numerical result
out of range" reports on the ML. Fix it by calling rbd_list() twice to
first get the expected buffer size. Also, get rid of the memory leak
and tweak the error message while at it.
Warren Usui [Fri, 21 Feb 2014 05:11:45 +0000 (21:11 -0800)]
Fix get_status() to find client.rados text inside of ps command results.
Added port (fixed value for right now in teuthology) to hostname. Fixes: 7374 Reviewed-by: Yehuda Sadeh <yehuda@inktank.com> Signed-off-by: Warren Usui <warren.usui@inktank.com>
(cherry picked from commit 8200b8a02511e367370d33cb74c3d45ef85fca31)
Yan, Zheng [Sun, 9 Mar 2014 23:36:14 +0000 (07:36 +0800)]
mds: fix owner check of file lock
flock and posix lock do not use process ID as owner identifier.
The process ID of who holds the lock is just for F_GETLK fcntl(2).
For linux kernel, File lock's owner identifier is the file pointer
through which the lock is requested.
The fix is do not take the 'pid_namespace' into consideration when
checking conflict locks. Also rename the 'pid' fields of struct
ceph_mds_request_args and struct ceph_filelock to 'owner', rename
'pid_namespace' fields to 'pid'.
The kclient counterpart of this patch modifies the flock code to
assign the file pointer to the 'owner' field of lock message. It
also set the most significant bit of the 'owner' field. We can use
that bit to distinguish between old and new clients.
Stephan Renatus [Mon, 10 Mar 2014 14:17:41 +0000 (15:17 +0100)]
rbdmap: bugfix upstart script
It seems like the upstart script is lacking a little behind [the initscript](https://github.com/ceph/ceph/blob/master/src/init-rbdmap#L44-L49); however, this bugfix makes it actually do what it should do.
Before, the bug made the job just ignore all parameters, with the following error in /var/log/upstart/rbdmap.log:
Ilya Dryomov [Mon, 10 Mar 2014 08:36:48 +0000 (10:36 +0200)]
FileStore: support compiling without libxfs
When configured with --without-libxfs, use GenericFileStoreBackend
instead of XfsFileStoreBackend for XFS. At this point this would only
impact the allocation hint op. The default is to compile with
--with-libxfs. (Previously it was unconditionally enabled on linux and
disabled for non-linux arches.)