mon: MonCommands.h: have 'auth' read-only operations require 'x' cap
This reintroduces the same semantics that were in place in dumpling prior
to the refactoring of the cap/command matching code.
We haven't added this requirement to auth read-write operations as that
would have the potential to break a lot of well-configured keyrings once
the users upgraded, without any significant gain -- we assume that if
they have set 'rw' caps on a given entity, they are indeed expecting said
entity to be sort-of-privileged entities with regard to monitor access.
Fixes: #7919 Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
ReplicatedPG: fix CEPH_OSD_OP_CREATE on cache pools
The following
./ceph osd pool create data-cache 8 8
./ceph osd tier add data data-cache
./ceph osd tier cache-mode data-cache writeback
./ceph osd tier set-overlay data data-cache
./rados -p data create foo
./rados -p data stat foo
results in
error stat-ing data/foo: No such file or directory
even though foo exists in the data-cache pool, as it should. STAT
checks for (exists && !is_whiteout()), but the whiteout flag isn't
cleared on CREATE as it is on WRITE and WRITEFULL. The problem is
that, for newly created 0-sized cache pool objects, CREATE handler in
do_osd_ops() doesn't get a chance to queue OP_TOUCH, and so the logic
in prepare_transaction() considers CREATE to be a read and therefore
doesn't clear whiteout. Fix it by allowing CREATE handler to queue
OP_TOUCH at all times, mimicking WRITE and WRITEFULL behaviour.
This will make the OSD randomly reject backfill reservation requests. This
exercises the failure code paths but does not break overall behavior
because the primary will back off and retry later.
qa: test_alloc_hint: set ec ruleset-failure-domain to osd
Create a custom profile with ruleset-failure-domain=osd. (The default
ruleset-failure-domain=host won't do because this script assumes and
works only if all osds are on the same host.) While at it, set k and m
explicitly to avoid troubles in the future.
stop.sh: unmap rbd images when stopping the whole cluster
Unmap rbd images when stopping the whole cluster. Not doing so results
in images that cannot be unmapped until the same cluster is brought
back up. Issue a warning if we failed to unmap all images.
Sage Weil [Wed, 2 Apr 2014 23:43:10 +0000 (16:43 -0700)]
lockdep: reset state on shutdown
If we shut down, clear out all of the lockdep state. This ensures that if
we start up again on another cct, we will not be confused by old type ids
and dependency state.
Sage Weil [Wed, 2 Apr 2014 23:46:30 +0000 (16:46 -0700)]
lockdep: do not initialize if already started
If we have already registered a cct for lockdep, do not accept another one.
We already check that the cct matches when we shut down. This we will run
for the life span of a single cct and no longer.
Fixes: #7965 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Wed, 2 Apr 2014 23:03:37 +0000 (16:03 -0700)]
OSDMap: bump snap_epoch when adding a tier
When we make an existing pool a tier, we start copying the snap metadata
from the base tier. That includes removed_snaps. In order for the OSD
to recognize that this value is changing for the first time, we need to
set snap_epoch, or else the OSD doesn't update it's in-memory PGPool
with removed snaps and we eventually hit an assertion failure because
PGPool::cached_remove_snaps is incorrect (e.g., empty).
Fix this by bumping snap_epoch when we add the new tier.
Fixes: #7915 Signed-off-by: Sage Weil <sage@inktank.com>
* Require "$remote_fs" since it guarantees /usr availability
(rbd executable is in /usr/bin/rbd)
* Speed-up init.d rbd mapping on machines acting as MON/OSD
by starting rbdmap after /init.d/ceph (when possible) and
shutting down rbd before ceph.
* Map rbd devices before starting X (helpful when /home is mounted from rbd).
mds: add dentries in dirfrag to LRU in reverse order
Files in a dirfrag are usually processed in the order of readdir
results. Files at the beginning of are more likely to be used in
the future than files at the last.
For across authority rename, the MDS first freezes the source inode's
authpin. It happens while the source dentry isn't locked. So when the
inode's authpin become frozen, the source dentry may have changed and
be linked to a different inode.
mds: treat cluster as degraded when there is clientreplay MDS
This forbids exporting subtrees and fragmenting dirfrags when there
is MDS in clientreplay state. During replaying client requests, the
MDS may need to authpin some remote objects. Exporting subtrees and
fragmenting dirfrags slow down replaying client requests.
Yan, Zheng [Mon, 31 Mar 2014 01:46:58 +0000 (09:46 +0800)]
mds: don't start new segment while finishing disambiguate imports
This avoid inserting ESubtreeMap among EImportFinish events that
finish disambiguate imports. Because the ESubtreeMap reflects the
subtree state when all EImportFinish events are replayed.
Sage Weil [Tue, 1 Apr 2014 21:27:31 +0000 (14:27 -0700)]
osd/ReplicatedPG: mark_unrollbackable when _rollback_to head
We fell into the case in _rollback_to where we just set ctx->modify = true
and don't explicitly mark the ctx and unrollbackable. Later, we screw up
in proc_replica_log as a result because we think we can rollback this
update to the head when in reality we cannot.
Fixes: #7907 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Tue, 1 Apr 2014 18:04:47 +0000 (11:04 -0700)]
osd/ReplicatedPG: handle snapdir properly during scrub
Handle snapdir similarly to how head is treated when updating the
next_clone info. Also, add a warning when we have a snapdir object and
head_exists == true (the converse of the existing check).
Fixes: #7937 Signed-off-by: Sage Weil <sage@inktank.com>
Josh Durgin [Mon, 31 Mar 2014 21:53:31 +0000 (14:53 -0700)]
librbd: skip zeroes when copying an image
This is the simple coarse-grained solution, but it works well in
common cases like a small base image resized with a bunch of empty
space at the end. Finer-grained sparseness can be copied by using rbd
{export,import}-diff.
By default, we don't send out maps with primary_temp mappings because
there is no infrastructure in place that would make sure that the
entire cluster knows about primary_temp. Add an option to allow
primary_temp mappings, for development purposes.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Sage Weil [Mon, 31 Mar 2014 17:42:23 +0000 (10:42 -0700)]
mon/PGMap: clear pool sum when last pg is deleted
Use the x.0 pg as a sentinel for the existence of the pool. Note that we
have to clean in up two paths: apply_incrmenetal (which is actually
deprecated) and the normal PGMonitor refresh.
Fixes: #7912 Signed-off-by: Sage Weil <sage@inktank.com>
Loic Dachary [Sat, 29 Mar 2014 10:30:42 +0000 (11:30 +0100)]
doc: pgbackend dev doc outdated notice
* Warn the reader that the implementation is ahead and may differ
* Update the links to the Firefly branch
* Remove links to issues used during development to avoid confusion
Loic Dachary [Sat, 29 Mar 2014 10:27:00 +0000 (11:27 +0100)]
doc: erasure code developer notes updates
Update the introduction to explain erasure code profiles. Remove
obsolete explanations about partial writes etc. Remove links to tickets
used during development. Update permalinks to be closer to
Firefly (v0.78).
Yan, Zheng [Sun, 30 Mar 2014 01:21:57 +0000 (09:21 +0800)]
fuse: implement 'access' low level function
Add an empty 'access' function to fuse low level functions. This
allow us to use ceph-fuse with fuse_default_permissions = false.
'fuse_default_permissions = false' can significantly improve the
speed of create/removing large number of files.
When fuse_default_permissions is true, the fuse kernel module sends
a getattr request whenever the kernel needs to check a directory's
permission. getattr (STAT_CAP_INODE_ALL) can be very slow if the
directory was just modified.