Sage Weil [Tue, 17 Dec 2013 17:28:43 +0000 (09:28 -0800)]
mon: warn if crush has non-optimal tunables
Allow warning to be disabled via ceph.conf. Link to the docs from the
warning detail. Add a section to the docs specifically about what to do
about the warning.
Sage Weil [Tue, 17 Dec 2013 05:33:07 +0000 (21:33 -0800)]
crush/mapper: generalize descend_once
The legacy behavior is to make the normal number of tries for the
recursive chooseleaf call. The descend_once tunable changed this to
making a single try and bail if we get a reject (note that it is
impossible to collide in the recursive case).
The new set_chooseleaf_tries lets you select the number of recursive
chooseleaf attempts for indep mode, or default to 1. Use the same
behavior for firstn, except default to total_tries when the legacy
tunables are set (for compatibility). This makes the rule step
override the (new) default of 1 recursive attempt, keeping behavior
consistent with indep mode.
Ilya Dryomov [Mon, 16 Dec 2013 16:57:22 +0000 (18:57 +0200)]
FileJournal: switch to get_linux_version()
For the purposes of FileJournal::_check_disk_write_cache(), use
get_linux_version(), which is based on uname(2), instead of parsing the
contents of /proc/version.
Yan, Zheng [Tue, 26 Nov 2013 10:32:18 +0000 (18:32 +0800)]
mds: simplify how to export non-auth caps
Introduce a new flag in cap import message. If client finds the flag
is set, it releases exporter's caps (send release to the exporter).
This saves the cap export message and a "mds to mds" message.
Yan, Zheng [Tue, 26 Nov 2013 09:19:04 +0000 (17:19 +0800)]
mds: send cap import messages to clients after importing subtree succeeds
When importing subtree, the importer sends cap import messages to clients
before the import subtree operation is considered as success. If the
exporter crashes before EExport event is journalled, the importer needs to
re-export client caps. This confuses clients, and makes them lose track of
auth caps.
Yan, Zheng [Tue, 26 Nov 2013 07:10:29 +0000 (15:10 +0800)]
mds: re-send cap exports in resolve message.
For rename operation that changes inode's authority, if master mds
of the operation crashed, inode's original auth mds sends export
messages to clients when it receives the master mds' resolve ack
message, Client can't reply on the export message to add caps for
the master mds, then reconnect the cap when the master mds enters
reconnect stage. Because client may receive the export message after
receiving mdsmap that claims the master mds is in reconnect stage.
The fix is include cap exports in resolve message, so the master mds
can send import messages to clients when it enters the rejoin stage.
Yan, Zheng [Tue, 26 Nov 2013 03:02:49 +0000 (11:02 +0800)]
mds: include counterpart's information in cap import/export messages
when exporting indoes with client caps, the importer sends cap import
messages to clients, the exporter sends cap export messages to clients.
A client can receive these two messages in any order. If a client first
receives cap import message, it adds the imported caps. but the caps
from the exporter are still considered as valid. This can compromise
consistence. If MDS crashes while importing caps, clients can only
receive cap export messages, but don't receive cap import messages.
These clients don't know which MDS is the cap importer, so they can't
send cap reconnect when the MDS recovers.
We can handle above issues by including counterpart's information in
cap import/export messages. If a client first receives cap import
message, it added the imported caps, then removes the the exporter's
caps. If a client first receives cap export message, it removes the
exported caps, then adds caps for the importer.
Yan, Zheng [Tue, 26 Nov 2013 02:31:07 +0000 (10:31 +0800)]
mds: send info of imported caps back to the exporter (rename)
use MMDSSlaveRequest::OP_FINISH slave request to send information
of rename imported caps back to the exporter. This is preparation
for including counterpart's information in cap import/export message.
Yan, Zheng [Tue, 26 Nov 2013 02:17:30 +0000 (10:17 +0800)]
mds: send info of imported caps back to the exporter (cache rejoin)
Use cache rejoin ack message to send information of rejoin imported
caps back to the exporter. Also move the code that exports reconnect
caps to MDCache::handle_cache_rejoin_ack()
This is preparation for including counterpart's information in cap
import/export message.
Yan, Zheng [Tue, 26 Nov 2013 01:49:21 +0000 (09:49 +0800)]
mds: send info of imported caps back to the exporter (export dir)
Introduce a new class Capability::Import and use it to send information
of imported caps back to the exporter. This is preparation for including
counterpart's information in cap import/export message.
Yan, Zheng [Fri, 25 Oct 2013 08:30:49 +0000 (16:30 +0800)]
mds: flush session messages before exporting caps
Following sequence of events can happen when exporting inodes:
- client sends open file request to mds.0
- mds.0 handles the request and sends inode stat back to the client
- mds.0 export the inode to mds.1
- mds.1 sends cap import message to the client
- mds.0 sends cap export message to the client
- client receives the cap import message from mds.1, but the client
still doesn't have corresponding inode in the cache. So the client
releases the imported caps.
- client receives the open file reply from mds.0
- client receives the cap export message from mds.0.
After the end of these events, the client doesn't have any cap for
the opened file.
To fix the message ordering issue, this patch introduces a new session
operation FLUSHMSG. Before exporting caps, we send a FLUSHMSG seesion
message to client and wait for the acknowledgment. When receiveing the
FLUSHMSG_ACK message from client, we are sure that clients have received
all messages sent previously.
Yan, Zheng [Mon, 18 Nov 2013 09:59:06 +0000 (17:59 +0800)]
mds: increase cap sequence when sharing max size
For case:
- client voluntarily releases some caps through cap update message
- mds shares the new max by sending cap grant message
- mds recevies the cap update message
If mds doesn't increase the cap sequence when sharing the max size.
It can't determine if the cap update message was sent before or after
client reveived the cap grant message that updates max size.
Yan, Zheng [Mon, 18 Nov 2013 03:06:43 +0000 (11:06 +0800)]
mds: include inode version in auth mds' lock messages
encode inode version in auth mds' lock messages, so that version
of replica inodes get updated. This is important because client
use inode version in mds reply to check if the cached inode is
already up-to-date. It skips updating the inode if it thinks the
inode is already up-to-date.
Yan, Zheng [Sun, 17 Nov 2013 09:03:29 +0000 (17:03 +0800)]
mds: waiting for slave reuqest to finish
If MDS receives a client request, but find there is an existing
slave request. It's possible that other MDS forwarded the request
to us, but the MMDSSlaveRequest::OP_FINISH message arrives after
the client request.
Yan, Zheng [Tue, 12 Nov 2013 08:12:25 +0000 (16:12 +0800)]
mds: keep dentry lock in sync state
unlike locks of other types, dentry lock in unreadable state can
block path traverse, so it should be in sync state as much as
possible.
This patch make Locker::try_eval() change dentry lock's state to
sync even when the dentry is freezing. Also make migrator check
imported dentries' lock states, change locks' states to sync if
necessary.
Yan, Zheng [Thu, 7 Nov 2013 09:07:51 +0000 (17:07 +0800)]
mds: fix empty directory check
Since commit 310032ee81(fix mds scatter_writebehind starvation), rdlock
a scatter lock does not always propagate dirty fragstats to corresponding
inode. So Server::_dir_is_nonempty() needs to check each dirfrag's stat
intead of checking inode's dirstat.
Yan, Zheng [Wed, 6 Nov 2013 01:42:43 +0000 (09:42 +0800)]
mds: handle cache rejoin corner case
A recovering MDS may receives strong cache rejoin from a survivor,
then the survivor restarts, the recovering MDS receives week cache
rejoin from the same MDS. Before processing the week cache rejoin,
we should scour replicas added by the obsoleted strong cache rejoin.
Yan, Zheng [Wed, 6 Nov 2013 01:28:51 +0000 (09:28 +0800)]
mds: unify nonce type
MDSCacheObject::replica_nonce is defined as __s16, but nonce type
in MDSCacheObject::replica_map is int. This mismatch may confuse
MDCache::handle_cache_expire().
Yan, Zheng [Thu, 24 Oct 2013 09:10:59 +0000 (17:10 +0800)]
mds: rework stale import/export message detection
Current code uses import state to detect obsolete import/export messages.
it does not work for the case: cancel a subtree export, export the same
subtree again, the messages for the first export get dispatched.
This patch introduces "transation ID" for subtree exports. Each subtree
export has a unique TID, the ID is recorded in all import/export related
messages. By comparing the TID, we can reliably detect stale messages.
Yan, Zheng [Thu, 24 Oct 2013 08:05:56 +0000 (16:05 +0800)]
mds: put import/export related states together
Current code uses several STL maps to record import/export related
states. A map lookup is required for each state access, this is not
efficient. It's better to put import/export related states together.
Yan, Zheng [Wed, 23 Oct 2013 01:15:58 +0000 (09:15 +0800)]
mds: freeze tree deadlock detection.
there are two situations that result freeze tree deadlock.
- mds.0 authpins an item in subtree A
- mds.0 sends request to mds.1 to authpin an item in subtree B
- mds.0 freezes subtree A
- mds.1 authpins an item in subtree B
- mds.1 sends request to mds.0 to authpin an item in subtree A
- mds.1 freezes subtree B
- mds.1 receives the remote authpin request from mds.0
(wait because subtree B is freezing)
- mds.0 receives the remote authpin request from mds.1
(wait because subtree A is freezing)
- client request authpins items in subtree B
- freeze subtree B
- import subtree A which is parent of subtree B
(authpins parent inode of subtree B, see CDir::set_dir_auth())
- freeze subtree A
- client request tries authpinning items in subtree A
(wait because subtree A is freezing)
Enforcing a authpinning order can avoid the deadlock, but it's very
expensive. The deadlock is rare, so I think deadlock detection is
more suitable for the case.
This patch introduces freeze tree deadlock detection. We record the
start time of freezing tree. If we fail to freeze the tree within a
given duration, cancel the process of freezing tree.
Loic Dachary [Sun, 15 Dec 2013 20:41:45 +0000 (21:41 +0100)]
qa: make cephtool test imune to pool size
instead of assuming the pool size is 2, query it and increment it to
test for pool set data size. It allows to run the test from vstart.sh
without knowing what the required pool size is in advance:
rm -fr dev out ; mkdir -p dev ; \
MON=1 OSD=3 ./vstart.sh -n -X -l mon osd
Loic Dachary [Sun, 15 Dec 2013 15:27:02 +0000 (16:27 +0100)]
mon: set ceph osd (down|out|in|rm) error code on failure
Instead of always returning true, the error code is set if at least one
operation fails.
EINVAL if the OSD id is invalid (osd.foobar for instance).
EBUSY if trying to remove and OSD that is up.
When used with the ceph command line, it looks like this:
ceph -c ceph.conf osd rm osd.0
Error EBUSY: osd.0 is still up; must be down before removal.
kill PID_OF_osd.0
ceph -c ceph.conf osd down osd.0
marked down osd.0.
ceph -c ceph.conf osd rm osd.0 osd.1
Error EBUSY: removed osd.0, osd.1 is still up; must be down before removal.
Steve Stock [Sat, 14 Dec 2013 21:44:06 +0000 (16:44 -0500)]
Add -n option to mount.ceph. Required by autofs when /etc/mtab is a link to /proc/mounts (e.g. Debian Wheezy), otherwise automounting a ceph file system fails. Also useful when /etc is read-only. feature 7006
Loic Dachary [Sun, 15 Dec 2013 13:31:27 +0000 (14:31 +0100)]
common: fix rare race condition in Throttle unit tests
The thread created to test Throttle race conditions updates a value (
throttle.get_current() ) that is tested by the main gtest thread but is
not protected by a lock. Instead of adding a lock, the main thread tests
the value after pthread_join() on the child thread.
John Wilkins [Sat, 14 Dec 2013 00:08:37 +0000 (16:08 -0800)]
doc: Updates to federated config.
Reverted Emperor versionadded to Dumpling as it gets backported.
Added default index and bucket pools to pool creation
Added default default_placment setting
Added placement_pools key val pair examples.
Added comments for re-running the procedure for the secondary region.
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
Ilya Dryomov [Fri, 13 Dec 2013 15:40:52 +0000 (17:40 +0200)]
rbd: modprobe with single_major=Y on newer kernels
On kernels that support it, and if 'rbd map' is given a chance to
modprobe, turn on single-major device number allocation scheme. For
users who for some reason don't want it, the workaround is to insert
the rbd module manually before executing the first 'rbd map' command.
Ilya Dryomov [Fri, 13 Dec 2013 15:40:52 +0000 (17:40 +0200)]
rbd: add support for single-major device number allocation scheme
With the preparatory commits ("rbd: match against wholedisk device
numbers on unmap" and "rbd: match against both major and minor on unmap
on kernels >= 3.14") in, this amounts to chosing to work with new rbd
bus interfaces (/sys/bus/rbd/{add,remove}_single_major) if they are
available, instead of the old ones (/sys/bus/rbd/{add,remove}).
Ilya Dryomov [Fri, 13 Dec 2013 15:40:52 +0000 (17:40 +0200)]
rbd: match against both major and minor on unmap on newer kernels
As described in commit "rbd: match against wholedisk device numbers on
unmap", currently we only match against major numbers. In preparation
for support for single-major device number allocation scheme, start
matching against minor numbers also, which newer kernels provide in
a /sys/bus/rbd/devices/<id>/minor sysfs attribute.
Ilya Dryomov [Fri, 13 Dec 2013 15:40:52 +0000 (17:40 +0200)]
rbd: match against whole disks on unmap
Currently the way 'rbd unmap' translates a user-provided block device
into an rbd id is it matches the major number of the specified device
against /sys/bus/rbd/devices/<id>/major for each rbd mapping and
declares success on the first match. This works for both entire disks
and partitions, because under the current device number allocation
scheme, each mapping means a new major number.
In preparation for support for single-major device number allocation
scheme, which would require matching both major and minor numbers, make
sure to always match against entire disk device numbers, by converting
the specified device major:minor pair into wholdedisk major:minor pair.
To achive that, use the libblkid library, which accomplishes this goal
by walking stable sysfs structures.
Ilya Dryomov [Fri, 13 Dec 2013 15:40:52 +0000 (17:40 +0200)]
rbd: switch to strict_strtol for major parsing
Use common/strict_strtol, which actually parses integers in a proper
way, instead of atoi for parsing /sys/bus/rbd/devices/<id>/major. This
is important, because the kernel apparently can write things like
"(none)" into that file, and in general is more bulletproof.