Haomai Wang [Thu, 30 Jan 2014 03:11:12 +0000 (19:11 -0800)]
FileStore: avoid leveldb check for xattr when possible
Maintain an internal xattr called "spill_out" that indicates whether we
(may) have xattrs stored in omap. If attribute is set, it will indicate
that we should or should not look in omap. If the attribute is not
present, then we do not know and will also need to check.
For new stores, this will avoid the overhead of consulting omap in the
general case until a particular objects gets enough (or big) xattrs and
spills over.
For old stores, we will effectively fall back to the previous behavior
of always checking.
Implements #7059
Signed-off-by: Haomai Wang <haomaiwang@gmail.com> Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
Ilya Dryomov [Wed, 29 Jan 2014 14:12:01 +0000 (16:12 +0200)]
rbd: don't forget to call close_image() if remove_child() fails
close_image() among other things unregisters a watcher that's been
registered by open_image(). Even though it'll timeout in 30 or so
seconds, it's not nice now that we check for watchers before starting
the removal process.
Ilya Dryomov [Wed, 29 Jan 2014 14:12:01 +0000 (16:12 +0200)]
rbd: check for watchers before trimming an image on 'rbd rm'
Check for watchers before trimming image data to try to avoid getting
into the following situation:
- user does 'rbd rm' on a mapped image with an fs mounted from it
- 'rbd rm' trims (removes) all image data, only header is left
- 'rbd rm' tries to remove a header and fails because krbd has a
watcher registered on the header
- at this point image cannot be unmapped because of the mounted fs
- fs cannot be unmounted because all its data and metadata is gone
Unfortunately, this fix doesn't make it impossible to happen (the
required atomicity isn't there), but it's a big improvement over the
status quo.
Ilya Dryomov [Thu, 30 Jan 2014 11:39:15 +0000 (13:39 +0200)]
pybind: work around find_library() not searching LD_LIBRARY_PATH
Commit b28b64a0b6db ("pybind: use find_library for libcephfs and
librbd") switched us to find_library(), but this function doesn't seem
to respect LD_LIBRARY_PATH. There are numerous python tickets, dating
back several years, but alas. Work around it by using the soname as
a fallback. (rados.py has been fixed by commit e46d2ca067b5 ("fix the
bug ctypes.util.find_library to search for librados failed on
Centos6.4.")
Haomai Wang [Wed, 29 Jan 2014 09:50:10 +0000 (17:50 +0800)]
Add KeyValueStore implementation
KeyValueStore is another ObjectStore implementation with FileStore. It
uses KV store wrapper(StripObjectMap) which inherited GenericObjectMap
to implement ObjectStore APIs.
Each object has a header key in KV backend, which encapsulated the metadata
of object such as size, the status of keys. A complete object data maybe spread
around multi keys. The CRUD operation of object need to access the header key
of object to know the details, then the actual data keys will be get.
Now the actual KV backend of KeyValueStore is only LevelDB, more KV backend
(RocksDB, NVM API) will be introduced in the near future.
Haomai Wang [Wed, 29 Jan 2014 09:46:00 +0000 (17:46 +0800)]
Add a new KV wrapper GenericObjectMap
Now we already have DBObjectMap which implement ObjectMap and other
interfaces, and ObjectMap.h implied that ObjectMap is used to encapsulates
the FileStore key value store. There exists limitation in current DBObjectMap
implementation, such as lacking of "coll_t" in "key", complicated prefix
hard-coded and inflexible extending.
So in order to provide a more flexible API and clear implementation to wrap KV
store, I copy the origin DBObjectMap and redesign the partial implementation.
Adding "coll_t" argument to all API and export "prefix" to callers. Prefixes
are divided into two parts "INTERN" and "USER". "INTERN" keys used by self to
manage and "USER" keys are managed by callers. Besides above, misc fixes are
imported such as more clear member function name and extendible header
structure.
Josh Durgin [Wed, 29 Jan 2014 01:26:58 +0000 (17:26 -0800)]
ceph-disk: run the right executables from udev
When run by the udev rules, PATH is not defined. Thus,
ceph-disk-activate relies on its which() function to locate the
correct executable. The which() function used os.defpath if none was
set, and this worked for anything using it.
ad6b4b4b08b6ef7ae8086f2be3a9ef521adaa88c added a new default value to
PATH, so only /usr/bin was checked by callers that did not use
which(). This resulted in the mount command not being found when
ceph-disk-activate was run by udev, and thus osds failing to start
after being prepared by ceph-deploy.
Make ceph-disk consistently use the existing helpers (command() and
command_check_call()) that use which(), so lack of PATH does not
matter. Simplify _check_output() to use command(),
another wrapper around subprocess.Popen.
Yehuda Sadeh [Mon, 27 Jan 2014 22:11:43 +0000 (14:11 -0800)]
rgw: fix multipart min part size
As part of the fix for wip-7169 it turned out that we removed
min_part_size. Looking back, the original implementation was broken
anyway and didn't do anything. This fixes it and makes it configurable.
Yehuda Sadeh [Fri, 17 Jan 2014 17:30:30 +0000 (09:30 -0800)]
rgw: fix multipart upload listing
Fixes: #7169
A separate fix has been created for dumpling.
Previously we read the entire list of parts, disregarding the actual
marker and the requested max parts. This fix refactors the way we read
the list of parts (doing it in parts, using marker).
Create new upload-id format that is used to identify uploads with sorted
omap entries. Make sure we're backward compatible and handle correctly
mixed-versions rgw uploads.
There's a window in-between receiving an MOSDPGTemp message from an OSD
and actually handling it that may lead to the pool the pg temps refer to
no longer existing. This may happen if the MOSDPGTemp message is queued
pending dispatching due to an on-going proposal (maybe even the pool
removal).
This patch fixes such behavior in two steps:
1. Check if the pool exists in the osdmap upon preprocessing
- if pool does not exist in the osdmap, then the pool must have been
removed prior to handling the message, but after the osd sent it.
- safe to ignore the pg update
2. If all pg updates in the message have been ignored, ignore the whole
message. Otherwise, let prepare handle the rest.
3. Recheck if pool exists in the osdmap upon prepare
- We may have ignored this pg back in preprocess, but other pgs in the
message may have led the message to be passed on to prepare; ignore
pg update once more.
4. Check if pool is pending removal and ignore pg update if so.
We delegate checking the pending value to prepare_pgtemp() because in this
case we should only ignore the update IFF the pending value is in fact
committed. Otherwise we should retry the message. prepare_pgtemp() is
the appropriate place to do so.
Sage Weil [Tue, 28 Jan 2014 18:26:12 +0000 (10:26 -0800)]
buffer: make 0-length splice() a no-op
This was causing a problem in the Striper, but fixing it here will avoid
corner cases all over the tree. Note that we have to bail out before
the end-of-buffer check to avoid hitting that check when the bufferlist is
also empty.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
There's a window in-between receiving an MOSDPGTemp message from an OSD
and actually handling it that may lead to the pool the pg temps refer to
no longer existing. This may happen if the MOSDPGTemp message is queued
pending dispatching due to an on-going proposal (maybe even the pool
removal).
This patch fixes such behavior in two steps:
1. Check if the pool exists in the osdmap upon preprocessing
- if pool does not exist in the osdmap, then the pool must have been
removed prior to handling the message, but after the osd sent it.
- safe to ignore the pg update
2. If all pg updates in the message have been ignored, ignore the whole
message. Otherwise, let prepare handle the rest.
3. Recheck if pool exists in the osdmap upon prepare
- We may have ignored this pg back in preprocess, but other pgs in the
message may have led the message to be passed on to prepare; ignore
pg update once more.
4. Check if pool is pending removal and ignore pg update if so.
We delegate checking the pending value to prepare_pgtemp() because in this
case we should only ignore the update IFF the pending value is in fact
committed. Otherwise we should retry the message. prepare_pgtemp() is
the appropriate place to do so.
Fixes: 7116 Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Sage Weil [Fri, 24 Jan 2014 19:04:37 +0000 (11:04 -0800)]
OSDMap: fix damaging input osdmap from remove_down_temps
The default copy constructor copies shared_ptrs do vectors that are then
modified by apply_incremental, which means that the const osdmap argument
isn't in fact const. Fix this by doing a deep(ish) copy.
Fixes: #7060 Signed-off-by: Sage Weil <sage@inktank.com>
Derek Yarnell [Mon, 27 Jan 2014 19:27:51 +0000 (12:27 -0700)]
packaging: apply udev hack rule to RHEL
In the RPM spec file there is a test to deploy the uuid hack udev rules
for older udev operating systems. This includes CentOS and RHEL, but the
check currently only is for CentOS, causing RHEL clients to get a bogus
osd rules file.
Adjust the conditional to apply to RHEL as well as CentOS. (The %{rhel}
macro is defined in both platforms' redhat-rpm-config package.)
Derek Yarnell [Mon, 27 Jan 2014 19:27:51 +0000 (12:27 -0700)]
packaging: apply udev hack rule to RHEL
In the RPM spec file there is a test to deploy the uuid hack udev rules
for older udev operating systems. This includes CentOS and RHEL, but the
check currently only is for CentOS, causing RHEL clients to get a bogus
osd rules file.
Adjust the conditional to apply to RHEL as well as CentOS. (The %{rhel}
macro is defined in both platforms' redhat-rpm-config package.)
Fixes http://tracker.ceph.com/issues/7245
Signed-off-by: Ken Dreyer <ken.dreyer@inktank.com>
Loic Dachary [Wed, 8 Jan 2014 11:41:06 +0000 (12:41 +0100)]
mon: shell test helpers to run MONs from sources
The intent is to make it more convenient to reproduce a specific mon
behavior and observe the result by grepping the logs. It can be handy
for bug diagnostic. The test could be included as a unit test to
be run on make check.
The setup function will prepare a directory and kill leftover from a
previous run. The teardown function cleans up on success. The run
function is expected to be provided by the calling script and can make
use of the run_mon function to mkfs + run a monitor.
Loic Dachary [Sun, 26 Jan 2014 12:32:57 +0000 (13:32 +0100)]
unittests: fail early when low on disk
Scripts from qa that are run as unittests via test/vstart_wrapper.sh may
fail because the partition on which it runs is low on space ( 95% full
). When it happens the cause of the problem may be unclear because it
is likely to show only as a client not being able to reach the mon.
A test is added in test/vstart_wrapper.sh to verify the disk space usage
using the same method as the mon would and fail with a detailed error if
it is the case.
Yehuda Sadeh [Tue, 14 Jan 2014 22:48:16 +0000 (14:48 -0800)]
rgw: quota thread for full user stats sync
Get user stats up to date periodically. Add configurables for different
periods, whether we update idle users.
Make sure radosgw-admin does not start the quota threads.
Greg Farnum [Fri, 24 Jan 2014 19:07:13 +0000 (11:07 -0800)]
mon: do not use CEPH_FEATURES_ALL for things that touch the disk
We want to encode with our quorum_features instead. Remaining uses of
CEPH_FEATURES_ALL are:
1) when the Elector is sharing its supported features
2) in a MonMap function which is used by monmaptool
3) In the Monitor for winning a standalone election, for ephemeral data,
and for doing mkfs (when we necessarily don't have quorum_features).
4) When doing ceph-mon --inject-monmap (we don't persist the quorum_features
to disk, so we can't use them here).
5) in MMonElection, for doing the default monmap encoding (which is
re-encoded later if the final features don't match CEPH_FEATURES_ALL).
6) As the default encode features for the OSDMap (the monitor always
supplies quorum_features instead).
Greg Farnum [Thu, 23 Jan 2014 22:52:40 +0000 (14:52 -0800)]
Elector: send an OP_NAK MMonElection to old peers who support it
Only new monitors support receiving OP_NAK from a peer without crashing, but
when we add new required features in the future, our monitors can accept
an OP_NAK message which tells them what features they're missing. Then they
will print out an error message and shut down.
(Unfortunately, doing a clean shutdown from here would require a lot of
infrastructure, so we just call exit(0).)
Greg Farnum [Thu, 23 Jan 2014 21:08:26 +0000 (13:08 -0800)]
Elector: ignore messages from mons without required feature capabilities
We maintain a list of required_features which the other monitor's features
must supply. This starts out at 0 and is initialized from the monitor's
list of features whenever we start electing.
Despite the scary sound of "just ignore it", this is safe: the monitor
will only record features as required once a quorum has formed in which
every monitor supports them. After that happens, monitors which do not
support those features will be unable to read the whole mon store/understand
the pg reports/whatever else, so letting them into the quorum would be buggy
behavior.
So if we ignore a monitor, it will not be able to start nor join
an election round with anybody who was in our quorum -- that is, the
ignored monitor cannot form a separate quorum. By ignoring it here, we
also prevent it from endlessly calling elections against the real
quorum.
Unfortunately there is no way to communicate to old monitors that they
cannot join the quorum -- there are no existing messages for that purpose,
and eg adding a new op to the MMonElection message will just cause it
to crash, which we don't want to do either.
Greg Farnum [Thu, 23 Jan 2014 02:01:11 +0000 (18:01 -0800)]
Monitor: introduce a function that translates quorum features into disk features
After an election, we call apply_quorum_to_compatset_features(). It translates
from the quorum features into monitor disk state features we care about and
adds them to the disk store. This prevents an older daemon from starting up
on a store it doesn't understand.
While strictly speaking we don't need to add the EC feature until we create
an EC pool, all monitors which speak the new OSDMap encoding also support EC
pools, and having more than one feature makes the pattern clearer.
Yehuda Sadeh [Mon, 13 Jan 2014 22:19:27 +0000 (14:19 -0800)]
rgw, cls_user: fix bucket creation
There's a single op to create and update the user bucket info, however,
the cases differ a bit, as we only need to guard against ENOENT if we're
updating the info.