Sage Weil [Sun, 20 Jul 2014 14:48:47 +0000 (07:48 -0700)]
os/FileStore: fix max object name limit
Our max object name is not limited by file name size, but by the length of
the name we can stuff in an xattr. That will vary from file system to
file system, so just make this 4096. In practice, it should be limited
via the global tunable, if it is adjusted at all.
Sage Weil [Fri, 18 Jul 2014 17:42:11 +0000 (10:42 -0700)]
os: add ObjectStore::get_max_attr_name_length()
Most importantly, capture that attrs on FileStore can't be more than about
100 chars. The Linux xattrs can only be 128 chars, but we also have some
prefixing we do.
Sage Weil [Wed, 16 Jul 2014 21:17:27 +0000 (14:17 -0700)]
osd: add config for osd_max_object_name_len = 2048 (was hard-coded at 4096)
Previously we had a hard coded limit of 4096. Objects > 3k crash the OSD
when running on ext4, although they probably work on xfs. But rgw only
generates objects a bit over 1024 bytes (maybe 1200 tops?), so let set a
more reasonable limit here. 2048 is a nice round number and should be
safe.
Add a test.
Fixes: #8174 Signed-off-by: Sage Weil <sage@redhat.com>
John Spray [Thu, 17 Jul 2014 12:15:45 +0000 (13:15 +0100)]
mds: fix journal reformat failure in standbyreplay
In the 0.82 release, standbyreplay MDS daemons would try
to reformat the jouranl if they saw an older version on
disk, where this should have only been done by the active
MDS for the rank. Depending on timing, this could cause
fatal corruption of the journal.
This change handles the following cases:
* only do reformat if not in standbyreplay (else raise EAGAIN
to keep trying til an active mds reformats it)
* if journal header goes away while in standbyreplay then raise
EAGAIN (handle rewrite happening in background)
* if journal version is greater than the max supported, suicide
Fixes: #8811 Signed-off-by: John Spray <john.spray@redhat.com>
John Spray [Thu, 17 Jul 2014 12:15:10 +0000 (13:15 +0100)]
osdc/Journaler: validate header on load and save
Previously if the journal header contained invalid
write, expire or trimmed offsets, we would end up
hitting a hard-to-understand assertion much later.
Instead, raise the error right away if the fields
are identifiably bad at load time, and assert that
they're valid before persisting them.
John Spray [Wed, 9 Jul 2014 11:43:04 +0000 (12:43 +0100)]
qa: generalise cephtool for vstart+MDS
Previously this test assumed no pre-existing
filesystem and no MDS running. Generalize it
to nuke any existing filesystems found before
running, so that you can use it inside a vstart
cluster that had MDS>0.
rgw: don't try to wait for pending if list is empty
Fixes: #8846
Backport: firefly, dumpling
This was broken at ea68b9372319fd0bab40856db26528d36359102e. We ended
up calling wait_pending_front() when pending list was empty.
This commit also moves the need_to_wait check to a different place,
where we actually throttle (and not just drain completed IOs).
rgw: don't try to wait for pending if list is empty
Fixes: #8846
Backport: firefly, dumpling
This was broken at ea68b9372319fd0bab40856db26528d36359102e. We ended
up calling wait_pending_front() when pending list was empty.
This commit also moves the need_to_wait check to a different place,
where we actually throttle (and not just drain completed IOs).
Sage Weil [Wed, 16 Jul 2014 01:11:41 +0000 (18:11 -0700)]
init-ceph: wrap daemon startup with systemd-run when running under systemd
We want to make sure the daemon runs in its own systemd environment. Check
for systemd as pid 1 and, when present, use systemd-run -r <cmd> to do
this.
Probably fixes #7627
Signed-off-by: Sage Weil <sage@redhat.com> Reviewed-by: Dan Mick <dan.mick@inktank.com> Tested-by: Dan Mick <dan.mick@inktank.com>
John Spray [Tue, 15 Jul 2014 13:28:32 +0000 (14:28 +0100)]
doc: add cephfs layout documentation
This clarifies how to deal with layouts in CephFS
using vxattrs. We can point people here if they
ask what they should use instead of the deprecated
`cephfs set_layout`.
John Spray [Tue, 24 Jun 2014 20:22:04 +0000 (13:22 -0700)]
mds: add `session ls` and `session evict` to asok
These commands are intended to help admins deal
with MDSs during recovery, to identify troublesome
clients which may need intervention (such as eviction).
There are cases when automatic (un-)mounting of file system on RBD is not
enough. Some services may need to be started when RBD device becomes
available (mapped) as well as it may be desirable to stop services in order
to release file system before unmapping RBD device.
File system(s) on RBD is not the only use case scenario. RBD devices may be
used as block devices in which case `/etc/fstab` is not sufficient to
perform action upon mapping RBD device. A handler script (hook) can be
useful to properly release RBD device before unmapping, etc.
Pre-unmap hooks can be important for clean shut down and for re-exporting
RBD device(s) as (iSCSI,AoE,DRBD) etc.
This commit introduces support for per-device hooks to perform per-device
post-map/pre-unmap actions. If hook named like "poolname/imagename" (same
as in `/etc/ceph/rbdmap` file) is found in
Sage Weil [Fri, 11 Jul 2014 18:31:22 +0000 (11:31 -0700)]
osd/osd_types: be pedantic about encoding last_force_op_resend without feature bit
The addition of the value is completely backward compatible, but if the
mon feature bits don't match it can cause monitor scrub noice (due to the
parallel OSDMap encoding). Avoid that by only adding the new field if the
feature (which was added 2 patches after the encoding, see 3152faf79f498a723ae0fe44301ccb21b15a96ab and 45e79a17a932192995f8328ae9f6e8a2a6348d10.
Fixes: #8815
Backport: firefly Signed-off-by: Sage Weil <sage@redhat.com>
`/etc/init.d/rbdmap start` was doing `mount -a`. Although (arguably)
`mount -a -O _netdev` could be less disruptive, it's not RBD mapping job to
mount unrelated devices and potentially do it at the wrong time.
Solution is to call `mount {device}` which works as expected and mounts
device even if it given in form `mount /dev/rbd/pool/imagename` while
`/etc/fstab` uses UUID or LABEL notation.
Furthermore this commit
* fixes global exit code (it was always 0): now it is 0 only when
all devices were (un)mounted successfully; otherwise non-zero.
* replaces `mount -a` with per-device post-mapping `mount {dev}`
* show mapping progress using LSB functions per device instead of for
{start|stop} invocation.
* capture output of `(u)mount` (if any) and report it as "info".
mon: OSDMonitor: be scary about inconsistent pool tier ids
We may not crash your cluster, but you'll know that this is not something
that should have happened. Big letters makes it obvious. We'd make them
red too if we bothered to look for the ANSI code.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Josh Durgin [Thu, 29 May 2014 19:23:30 +0000 (12:23 -0700)]
os: add prototype KineticStore
Implement the KeyValueDB interface using libkinetic_client,
and allow it to be configured as the backend for the KeyValueStore,
running the entire OSD on it.
This prototype implementation has no transaction safety, and is
only suitable as a proof of concept. Since the libkinetic_client
API does not provide reverse iteration over keys without also reading
the value off disk, it implements iterators in a very slow but correct way.
These are used heavily by the KeyValueDB callers, so this is a bottleneck
in performance.