Josh Durgin [Fri, 27 Dec 2013 01:38:52 +0000 (17:38 -0800)]
librbd: call user completion after incrementing perfcounters
The perfcounters (and the ictx) are only valid while the image is
still open. If the librbd user gets the callback for its last I/O,
then closes the image, the ictx and its perfcounters will be
invalid. If the AioCompletion object is has not run the rest of its
complete() method yet, it will access these now-invalid addresses,
possibly leading to a crash.
The AioCompletion object is independent of the ictx and does not
access it again after incrementing perfcounters, so avoid this race by
calling the user's callback after this step. The AioCompletion object
will be cleaned up by the rest of complete_request(), independent of
the ImageCtx.
Alexandre Oliva [Thu, 19 Dec 2013 16:09:46 +0000 (08:09 -0800)]
mds: fix Resetter locking
ceph-mds --reset-journal didn't work; it would deadlock waiting for
the osdmap. Comparing the init code in the Dumper (that worked) with
that in the Resetter (that didn't), I noticed the lock had to be
released before waiting for the osdmap.
Now the resetter works. However, both the resetter and the dumper
fail an assertion after they've performed their task; I didn't look
into it:
../../src/msg/SimpleMessenger.cc: In function 'void SimpleMessenger::reaper()' t
hread 7fdc188d27c0 time 2013-12-19 04:48:16.930895
../../src/msg/SimpleMessenger.cc: 230: FAILED assert(!cleared)
ceph version 0.72.1-6-g6bca44e (6bca44ec129d11f1c4f38357db8ae435616f2c7c)
1: (SimpleMessenger::reaper()+0x706) [0x880da6]
2: (SimpleMessenger::wait()+0x36f) [0x88180f]
3: (Resetter::reset()+0x714) [0x56e664]
4: (main()+0x1359) [0x562769]
5: (__libc_start_main()+0xf5) [0x3632e21b45]
6: /l/tmp/build/ceph/build/src/ceph-mds() [0x564e49]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to int
erpret this.
2013-12-19 04:48:16.934093 7fdc188d27c0 -1 ../../src/msg/SimpleMessenger.cc: In
function 'void SimpleMessenger::reaper()' thread 7fdc188d27c0 time 2013-12-19 04
:48:16.930895
../../src/msg/SimpleMessenger.cc: 230: FAILED assert(!cleared)
Loic Dachary [Wed, 18 Dec 2013 16:16:08 +0000 (17:16 +0100)]
crush: silence error messages in unit tests
The error messages are intentional when error conditions are
created. They will create false positive in the gitbuilder parser when
the string error is found.
The --debug-crush flag is detected to allow the caller to reset the
verbosity level.
Sage Weil [Tue, 17 Dec 2013 17:28:43 +0000 (09:28 -0800)]
mon: warn if crush has non-optimal tunables
Allow warning to be disabled via ceph.conf. Link to the docs from the
warning detail. Add a section to the docs specifically about what to do
about the warning.
Laurent Barbe [Wed, 18 Dec 2013 13:20:24 +0000 (14:20 +0100)]
upstart: add rbdmap script
Upstart script for mapping / unmapping rbd device based on /etc/ceph/rbdmap file.
It does not mount or unmount filesystem, this part should be performed by _netdev option in fstab.
Loic Dachary [Tue, 17 Dec 2013 19:26:01 +0000 (20:26 +0100)]
erasure-code: tests must use aligned buffers
The underlying code assumes the memory buffer is aligned on a long
boundary which is not always the case. Using buffer::create_page_aligned
which calls posix_memalign ensure the allocated buffer starts at an
address that is properly aligned.
Loic Dachary [Mon, 16 Dec 2013 16:13:27 +0000 (17:13 +0100)]
qa: vstart wrapper helper for unittests
Primarily useful to run scripts from qa/workunits as part of make check.
vstart_wrapper.sh starts a vstart.sh cluster, runs the command given in
argument and tearsdown cluster when it completes.
The vstart_wrapped_tests.sh script contains the list of scripts that
need the vstart_wrapper.sh to run. It would not be necessary if automake
allowed passing argument to tests scripts. It also adds markers to the
output to facilitate searching the output because it can be very verbose.
This wrapper is kept simple and will probably evolve into something more
sophisticated depending on the scripts being added to
vstart_wrapper_tests.sh. There are numerous options, ranging from
parsing the yaml from ceph-qa-suite to figure out the configuration
cluster to converting the same yaml into a puppet manifest that is
applied locally or even driving OpenStack instances to avoid messing
with the local machine. But this would probably be overkill at this
point.
Ilya Dryomov [Tue, 17 Dec 2013 15:42:30 +0000 (17:42 +0200)]
rbd: make coverity happy
A recent coverity run found two "defects" in rbd.cc:
** CID 1138367: Time of check time of use (TOCTOU)
/rbd.cc: 2024 in do_kernel_rm(const char *)()
2019 const char *fname = "/sys/bus/rbd/remove_single_major";
2020 if (stat(fname, &sbuf)) {
2021 fname = "/sys/bus/rbd/remove";
2022 }
2023
2024 int fd = open(fname, O_WRONLY);
2025 if (fd < 0) {
** CID 1138368: Time of check time of use (TOCTOU)
/rbd.cc: 1735 in do_kernel_add(const char *, const char *, const char *)()
same as above, s/remove/add
There is nothing racey going on here, and this is not an instance of
TOCTOU, but, instead of silencing coverity with annotatations, redo
this with two open() calls.
Loic Dachary [Mon, 16 Dec 2013 15:27:34 +0000 (16:27 +0100)]
vstart/stop: use pkill instead of killall
killall fails to kill all OSDs when called as a oneliner. Replace with a
loop using pkill that retries until there are no more process to kill by
the required name.
Loic Dachary [Mon, 16 Dec 2013 13:36:26 +0000 (14:36 +0100)]
qa: recursively remove .gcno and .gcda
Instead of removing them only in the current directory. Leftovers
prevent running make check-coverage properly because lcov fails
when stumbling on old .gcno files with
lcov -d . -c -i -o check-coverage_base_full.lcov
Processing os/BtrfsFileStoreBackend.gcno
geninfo: ERROR: ceph/src/os/BtrfsFileStoreBackend.gcno: reached
unexpected end of file
Sage Weil [Tue, 17 Dec 2013 05:33:07 +0000 (21:33 -0800)]
crush/mapper: generalize descend_once
The legacy behavior is to make the normal number of tries for the
recursive chooseleaf call. The descend_once tunable changed this to
making a single try and bail if we get a reject (note that it is
impossible to collide in the recursive case).
The new set_chooseleaf_tries lets you select the number of recursive
chooseleaf attempts for indep mode, or default to 1. Use the same
behavior for firstn, except default to total_tries when the legacy
tunables are set (for compatibility). This makes the rule step
override the (new) default of 1 recursive attempt, keeping behavior
consistent with indep mode.
Ilya Dryomov [Mon, 16 Dec 2013 16:57:22 +0000 (18:57 +0200)]
FileJournal: switch to get_linux_version()
For the purposes of FileJournal::_check_disk_write_cache(), use
get_linux_version(), which is based on uname(2), instead of parsing the
contents of /proc/version.
Yan, Zheng [Tue, 26 Nov 2013 10:32:18 +0000 (18:32 +0800)]
mds: simplify how to export non-auth caps
Introduce a new flag in cap import message. If client finds the flag
is set, it releases exporter's caps (send release to the exporter).
This saves the cap export message and a "mds to mds" message.
Yan, Zheng [Tue, 26 Nov 2013 09:19:04 +0000 (17:19 +0800)]
mds: send cap import messages to clients after importing subtree succeeds
When importing subtree, the importer sends cap import messages to clients
before the import subtree operation is considered as success. If the
exporter crashes before EExport event is journalled, the importer needs to
re-export client caps. This confuses clients, and makes them lose track of
auth caps.
Yan, Zheng [Tue, 26 Nov 2013 07:10:29 +0000 (15:10 +0800)]
mds: re-send cap exports in resolve message.
For rename operation that changes inode's authority, if master mds
of the operation crashed, inode's original auth mds sends export
messages to clients when it receives the master mds' resolve ack
message, Client can't reply on the export message to add caps for
the master mds, then reconnect the cap when the master mds enters
reconnect stage. Because client may receive the export message after
receiving mdsmap that claims the master mds is in reconnect stage.
The fix is include cap exports in resolve message, so the master mds
can send import messages to clients when it enters the rejoin stage.
Yan, Zheng [Tue, 26 Nov 2013 03:02:49 +0000 (11:02 +0800)]
mds: include counterpart's information in cap import/export messages
when exporting indoes with client caps, the importer sends cap import
messages to clients, the exporter sends cap export messages to clients.
A client can receive these two messages in any order. If a client first
receives cap import message, it adds the imported caps. but the caps
from the exporter are still considered as valid. This can compromise
consistence. If MDS crashes while importing caps, clients can only
receive cap export messages, but don't receive cap import messages.
These clients don't know which MDS is the cap importer, so they can't
send cap reconnect when the MDS recovers.
We can handle above issues by including counterpart's information in
cap import/export messages. If a client first receives cap import
message, it added the imported caps, then removes the the exporter's
caps. If a client first receives cap export message, it removes the
exported caps, then adds caps for the importer.
Yan, Zheng [Tue, 26 Nov 2013 02:31:07 +0000 (10:31 +0800)]
mds: send info of imported caps back to the exporter (rename)
use MMDSSlaveRequest::OP_FINISH slave request to send information
of rename imported caps back to the exporter. This is preparation
for including counterpart's information in cap import/export message.
Yan, Zheng [Tue, 26 Nov 2013 02:17:30 +0000 (10:17 +0800)]
mds: send info of imported caps back to the exporter (cache rejoin)
Use cache rejoin ack message to send information of rejoin imported
caps back to the exporter. Also move the code that exports reconnect
caps to MDCache::handle_cache_rejoin_ack()
This is preparation for including counterpart's information in cap
import/export message.
Yan, Zheng [Tue, 26 Nov 2013 01:49:21 +0000 (09:49 +0800)]
mds: send info of imported caps back to the exporter (export dir)
Introduce a new class Capability::Import and use it to send information
of imported caps back to the exporter. This is preparation for including
counterpart's information in cap import/export message.
Yan, Zheng [Fri, 25 Oct 2013 08:30:49 +0000 (16:30 +0800)]
mds: flush session messages before exporting caps
Following sequence of events can happen when exporting inodes:
- client sends open file request to mds.0
- mds.0 handles the request and sends inode stat back to the client
- mds.0 export the inode to mds.1
- mds.1 sends cap import message to the client
- mds.0 sends cap export message to the client
- client receives the cap import message from mds.1, but the client
still doesn't have corresponding inode in the cache. So the client
releases the imported caps.
- client receives the open file reply from mds.0
- client receives the cap export message from mds.0.
After the end of these events, the client doesn't have any cap for
the opened file.
To fix the message ordering issue, this patch introduces a new session
operation FLUSHMSG. Before exporting caps, we send a FLUSHMSG seesion
message to client and wait for the acknowledgment. When receiveing the
FLUSHMSG_ACK message from client, we are sure that clients have received
all messages sent previously.
Yan, Zheng [Mon, 18 Nov 2013 09:59:06 +0000 (17:59 +0800)]
mds: increase cap sequence when sharing max size
For case:
- client voluntarily releases some caps through cap update message
- mds shares the new max by sending cap grant message
- mds recevies the cap update message
If mds doesn't increase the cap sequence when sharing the max size.
It can't determine if the cap update message was sent before or after
client reveived the cap grant message that updates max size.
Yan, Zheng [Mon, 18 Nov 2013 03:06:43 +0000 (11:06 +0800)]
mds: include inode version in auth mds' lock messages
encode inode version in auth mds' lock messages, so that version
of replica inodes get updated. This is important because client
use inode version in mds reply to check if the cached inode is
already up-to-date. It skips updating the inode if it thinks the
inode is already up-to-date.
Yan, Zheng [Sun, 17 Nov 2013 09:03:29 +0000 (17:03 +0800)]
mds: waiting for slave reuqest to finish
If MDS receives a client request, but find there is an existing
slave request. It's possible that other MDS forwarded the request
to us, but the MMDSSlaveRequest::OP_FINISH message arrives after
the client request.
Yan, Zheng [Tue, 12 Nov 2013 08:12:25 +0000 (16:12 +0800)]
mds: keep dentry lock in sync state
unlike locks of other types, dentry lock in unreadable state can
block path traverse, so it should be in sync state as much as
possible.
This patch make Locker::try_eval() change dentry lock's state to
sync even when the dentry is freezing. Also make migrator check
imported dentries' lock states, change locks' states to sync if
necessary.
Yan, Zheng [Thu, 7 Nov 2013 09:07:51 +0000 (17:07 +0800)]
mds: fix empty directory check
Since commit 310032ee81(fix mds scatter_writebehind starvation), rdlock
a scatter lock does not always propagate dirty fragstats to corresponding
inode. So Server::_dir_is_nonempty() needs to check each dirfrag's stat
intead of checking inode's dirstat.
Yan, Zheng [Wed, 6 Nov 2013 01:42:43 +0000 (09:42 +0800)]
mds: handle cache rejoin corner case
A recovering MDS may receives strong cache rejoin from a survivor,
then the survivor restarts, the recovering MDS receives week cache
rejoin from the same MDS. Before processing the week cache rejoin,
we should scour replicas added by the obsoleted strong cache rejoin.
Yan, Zheng [Wed, 6 Nov 2013 01:28:51 +0000 (09:28 +0800)]
mds: unify nonce type
MDSCacheObject::replica_nonce is defined as __s16, but nonce type
in MDSCacheObject::replica_map is int. This mismatch may confuse
MDCache::handle_cache_expire().