COMPARE AND WRITE operations are implemented by the kernel RBD client
using compound cmpext + write OSD requests. Such requests are handled
by OSDs in a serialized fashion, ensuring that no extra locking is
required for object atomicity; (bnc#948986).
Mike Christie [Wed, 29 Jul 2015 09:25:46 +0000 (04:25 -0500)]
osd: add write same op
This goes with kernel patches:
libceph: add support for write same requests
rbd: add support for writesame requests
This adds a new ceph request writesame. Write a buffer of length
writesame.data_length bytes at writesame.offset over writesame.length
bytes.
On the kernel rbd client side, we map this command to the SCSI
WRITE_SAME request.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Acked-by: David Disseldorp <ddiss@suse.de>
(cherry picked from commit 801c01415873b1b52fa53bfcd6a2653bc09a2c25)
Use full radosgw path for set_permissions (the %set_permissions macro expands
into a "/usr/bin/chkstat --system" call, which requires the radosgw binary path
as a parameter to take effect).
Do not add fscap if system security profile is "paranoid".
Set ownership and file mode bits of radosgw binary.
Signed-off-by: Nathan Cutler <ncutler@suse.cz> Signed-off-by: David Disseldorp <ddiss@suse.de>
Karol Mroz [Mon, 7 Sep 2015 23:14:27 +0000 (16:14 -0700)]
rgw: bypass civetweb dynamic ssl load and link libraries directly
dlopen() calls in civetweb are guarded by the NO_SSL_DL compiler flag.
Setting this during build, and linking the necessary libraries directly,
simplifies cases of differing library names/versions/locations/etc.
Fixes bsc#942874
Nathan Cutler [Tue, 6 Oct 2015 10:25:52 +0000 (12:25 +0200)]
ceph.spec.in: enable OBS post-build-checks to find systemd-tmpfiles
The openSUSE Build Service runs a number of "post-build checks" after the RPMs
have been generated. One of these tests the RPM scriptlets for idempotence.
Without this line in the specfile, the check fails on SLE_12 because it cannot
find the systemd-tmpfiles binary.
ceph.spec.in: Standardize systemd preun and postun scripts
Currently, the main ceph package and the ceph-radosgw behave
differently on upgrade. This commit unifies their behavior
to the following:
On package removal, disable and stop all related systemd units.
On package upgrade, do nothing unless there is a file /etc/sysconfig/ceph
containing a parameter CEPH_AUTO_RESTART_ON_UPGRADE. If parameter is set
to "yes", restart the systemd units iff they are running.
Nathan Cutler [Fri, 2 Oct 2015 10:15:08 +0000 (12:15 +0200)]
ceph.spec.in: fix for out-of-memory errors in OBS
Add "--param ggc-min-expand=20 --param ggc-min-heapsize=32768"
to RPM_OPT_FLAGS, ensuring gcc does not add debug symbols and is
more aggressive about garbage collection.
Thanks to Berthold Gunreben for debugging this issue.
ceph-disk: fix dmcrypt_map() usage for LUKS activate
29431944c77adbc3464a8faeb7e052b24f821780 added a call to dmcrypt_map()
during disk activation. The change is not suitable for use alongside
the recently added dmcrypt LUKS support, because:
- The callers don't correctly provide cryptsetup_parameters or luks
arguments.
- dmcrypt_map() calls LuksFormat, which should never be performed
during disk activation.
- The key file paths don't carry the luks suffix when required.
This commit addresses these issues. Corresponding tests and a udev file
update will follow.
David Disseldorp [Mon, 11 May 2015 23:45:34 +0000 (01:45 +0200)]
systemd: activate disks via systemd service instead of udev
The udev(7) man page states:
RUN
...
This can only be used for very short-running foreground tasks. Running
an event process for a long period of time may block all further
events for this or a dependent device.
Starting daemons or other long-running processes is not appropriate
for udev; the forked processes, detached or not, will be
unconditionally killed after the event handling has finished.
ceph-disk activate is far from a short-running task:
- check whether path is a block dev, for dirs call through to
activate_dir()
- call blkid to obtain the filesystem type for the block dev
- pull mount options from hard-coded ceph.conf file
- mount the OSD dev at a temporary path
- check the ceph magic for mounted filesystem
- read cluster uuid and locate corresponding /etc/ceph/{cluster}.conf
path
- read or generate (if missing) the OSD uuid
- create a file indicating init system usage (systemd)
- mount the device at a second (final) location
- umount (lazy) the temporary mount path
- enable the systemd ceph-osd@{osd_id} service
- start the systemd ceph-osd@{osd_id} service
This logic is therefore best left in a systemd service for execution. As
it is less limited in terms of execution time, and also allows for
improved event handling in future (fsck, dmcrypt mapping etc.).
This change sees 95-ceph-osd.rules.systemd trigger ceph-disk activate or
ceph-disk activate-journal via new ceph-disk-activate-journal@.service,
ceph-disk-activate@.service and ceph-disk-dmcrypt-activate@.service
systemd service files.
ceph-disk-dmcrypt-activate@.service makes use of the newly added
--dmcrypt parameter for ceph-disk activate.
Signed-off-by: David Disseldorp <ddiss@suse.de>
(cherry picked from commit 85a894697e6be1240567f10ba2415eb45e58b22c)
[ddiss@suse.de: rebase without systemd/Makefile.am]
Mike Christie [Wed, 29 Jul 2015 09:25:45 +0000 (04:25 -0500)]
osd: add new extent comparison op
This goes with kernel patch
libceph: add support for CMPEXT compare extent requests
and
rbd: add support for COMPARE_AND_WRITE/CMPEXT
This adds support for the CMPEXT request. The request will compare
extent.length bytes and compare them to extent.length bytes at
extent.offset on disk. If there is a miscompare the osd will return
-EILSEQ, the offset in the buffer where it occurred, and the buffer.
This op is going to be used for SCSI COMPARE_AND_WRITE support. For this
SCSI command, we are required to atomically do the CMPEXT operation and if
successful do a WRITE operation. The kernel rbd client is sending those
two ops in a multi op request.
Note: I am still working on the locking for this operation. Is there
a local lock I can take?
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Acked-by: David Disseldorp <ddiss@suse.de>
(cherry picked from commit 54fbbe64754d2641c84605159122168ecd144bec)
Nathan Cutler [Thu, 13 Aug 2015 13:36:02 +0000 (15:36 +0200)]
ceph.spec.in: test %preun argument is zero for removal-only operations
The %preun section now contains logic for disabling and stopping all the
Ceph systemd units when the ceph package is removed. However, there is no
conditional around it, so the units are disabled and stopped on RPM upgrade
as well as removal.
Since we BuildRequire: babeltrace-devel, autoconf will see that babeltrace
is available during the build, and make will build/install the rbd-replay-prep
utility.
Owen Synge [Wed, 24 Jun 2015 18:16:54 +0000 (20:16 +0200)]
ceph.spec.in: add a bcond_with for jemalloc
jemalloc like tcmalloc is a high performance replacement for glibc malloc.
Which is better for ceph can only be told via benchmarks and testing.
This patch makes it easer to test this as an rpm install.
Nathan Cutler [Sat, 6 Jun 2015 11:44:20 +0000 (13:44 +0200)]
ceph.spec.in: move specific BuildRequires to where they belong
Move distro-specific BuildRequires out of "common" section and
into the appropriate %if statement in the "specific" section.
Also remove a duplicated "Requires: gdisk".
Fedora 12 has been EOL for a long time. Remove the reference in the
RPM .spec file.
Since RHEL 5 support for Ceph is a work in progress, we won't remove
this entire python_sitelib / python_sitearch conditional for now, since
those are still needed on RHEL 5.
Add the rhel_version macro to make the conditional compatible with
SUSE's OBS.
Ken Dreyer [Wed, 24 Jun 2015 22:39:30 +0000 (16:39 -0600)]
ceph.spec.in: package rbd-replay-prep on all Fedoras
This reverts the change in commit 85517d611b7bf4cb6cbffcd2c65303be0d038264. Since we BuildRequire:
libbabeltrace-devel, autoconf will see that babeltrace is available
during the build, and make will build/install the rbd-replay-prep
utility.
This change also simplifies Fedora selection logic, because Fedora 19 is
EOL, so "%{fedora}" implies "Fedora 20 and above".
Owen Synge [Mon, 8 Jun 2015 15:48:55 +0000 (17:48 +0200)]
ceph.spec.in:BuildRequires sharutils
The uudecode binary is used to build Java-related components, and
uudecode is provided by the sharutils package on all supported
RPM platforms. When building with "--without=cephfs_java",
sharutils is not needed.
Thanks to Nathan Cutler <ncutler@suse.cz> for going into the
details with me.
On OBS without this patch we get the error message:
Nathan Cutler [Wed, 13 May 2015 12:57:48 +0000 (14:57 +0200)]
ceph.spec.in: include SUSE in _with_systemd
The master specfile newly defines a _with_systemd variable that should be true
for the set of distros that are using systemd. Since this set of distros
includes SUSE/openSUSE (at least for the more recent versions where ceph is
supported), this commit sets _with_systemdto true on SUSE/openSUSE.
Ken Dreyer [Thu, 9 Apr 2015 17:10:52 +0000 (11:10 -0600)]
ceph.spec.in: set _with_systemd on RHEL 7 and Fedora
Commit 71a5090bca049a43e30a7f0cf99141950ef9c5dd added a "_with_systemd"
conditional to the RPMs, but I erred with the version comparison
operator, so this only applied to RHEL 8+, not RHEL 7+.
Adjust the conditional so that it will really apply to RHEL 7+. While
we're here, add Fedora as well.
Signed-off-by: Ken Dreyer <kdreyer@redhat.com> Reported-by: Boris Ranto <branto@redhat.com>
(cherry picked from commit aa88364f30e2d2f254ade185a83ba263b48e2a73)
Ken Dreyer [Mon, 9 Mar 2015 20:14:57 +0000 (14:14 -0600)]
ceph.spec.in: fix handling of /var/run/ceph
Prior to this commit, we didn't install /var/run/ceph as a normal
directory. We used the %ghost directive and created the directory with
a "mkdir" command in %post.
This was lacking in several ways:
1) Simplicy: there is no need to use %ghost; other packages (eg.
mariadb) simply use a normal %dir for their socket directory.
2) RPM does not have control over the permissions of the /var/run/ceph
directory. This does not interact well with "rpm -V". Moreover,
once Ceph itself gets unprivileged user support, RPM itself won't
be able to set the permissions of the directory for a (future)
unprivileged UID.
3) On distributions that use systemd as an init system, /var/run is a
symlink to /run, which is tmpfs. This means that /var/run/ceph does
not persist across reboots on those systems.
Remove the %ghost directive; it makes more sense for RPM to simply
install this directory like the rest of the %files.
Add a "_with_systemd" conditional so we know which distros use systemd
as their init system. Add the /etc/tmpfiles.d/ceph.conf file on those
distros. See
http://www.freedesktop.org/software/systemd/man/tmpfiles.d.html
Nathan Cutler [Fri, 26 Jun 2015 11:13:33 +0000 (13:13 +0200)]
logrotate.conf: fixes for systemd
Before this patch, the command 'logrotate -f /etc/logrotate.d/ceph'
was generating an error "Failed to reload ceph.target: Job type reload is not
applicable for unit ceph.target".
Before we issue systemctl reload, check that there is at least
one active ceph-* service. (The hyphen is significant.)
Since we use grep, make the grep package a dependency.
Owen Synge [Mon, 26 Jan 2015 15:20:20 +0000 (16:20 +0100)]
New rich init system detection.
Uses both a database and detecting management commands to find init system.
Logs error is one of these two systems fails.
Raises error if both systems disgree.
Testing notes:
- works on SLE12
- works on openSUSE 13.1
- works on Scientific 6.4
- works on debian 7.7 (wheezy)
- works on debian 8 (jessie)
Owen Synge [Wed, 7 Jan 2015 10:36:24 +0000 (11:36 +0100)]
radosgw systemd support
Added a radosgw systemd support and associated prestart script.
- With improved checking over first revison.
- ceph-radosgw-prestart.sh now installed in /usr/lib/ceph-radosgw
Owen Synge [Wed, 3 Dec 2014 11:32:34 +0000 (12:32 +0100)]
Fix overflowing journel partitions.
This fixes bnc#896406. When useing ceph-disk to create a journel
parititon in the next available partition and thier is not enough
space ceph-disk did not provide a clear error message.
Jason Dillaman [Wed, 21 Oct 2015 17:12:48 +0000 (13:12 -0400)]
librbd: potential assertion failure during cache read
It's possible for a cache read from a clone to trigger a writeback if a
previous read op determined the object doesn't exist in the clone,
followed by a cached write to the non-existent clone object, followed
by another read request to the same object. This causes the cache to
flush the pending writeback ops while not holding the owner lock.