Sage Weil [Tue, 4 Feb 2014 20:14:14 +0000 (12:14 -0800)]
crush: fix off-by-one errors in total_tries refactor
Back in 27f4d1f6bc32c2ed7b2c5080cbd58b14df622607 we refactored the CRUSH
code to allow adjustment of the retry counts on a per-pool basis. That
commit had an off-by-one bug: the previous "tries" counter was a *retry*
count, not a *try* count, but the new code was passing in 1 meaning
there should be no retries.
Fix the ftotal vs tries comparison to use < instead of <= to fix the
problem. Note that the original code used <= here, which means the
global "choose_total_tries" tunable is actually counting retries.
Compensate for that by adding 1 in crush_do_rule when we pull the tunable
into the local variable.
This was noticed looking at output from a user provided osdmap.
Unfortunately the map doesn't illustrate the change in mapping behavior
and I haven't managed to construct one yet that does. Inspection of the
crush debug output now aligns with prior versions, though.
Yehuda Sadeh [Tue, 4 Feb 2014 18:34:02 +0000 (10:34 -0800)]
rgw: fix rgw_read_user_buckets() use of max param
Fixes: #7336
The rgw_read_user_buckets() treated the max param as the max number of
entries to request in a single op, but always fetched the entire list
of buckets. This is wrong, as it should have treated it as the total
number of entries requested. All the callers assume the latter.
Sage Weil [Tue, 4 Feb 2014 05:12:41 +0000 (21:12 -0800)]
client: fix warnings
client/Client.cc: In member function 'int Client::_read(Fh*, int64_t, uint64_t, ceph::bufferlist*)':
warning: client/Client.cc:5893:27: comparison between signed and unsigned integer expressions [-Wsign-compare]
client/Client.cc: In member function 'int Client::_write(Fh*, int64_t, uint64_t, const char*)':
warning: client/Client.cc:6235:30: comparison between signed and unsigned integer expressions [-Wsign-compare]
Sage Weil [Mon, 3 Feb 2014 21:19:14 +0000 (13:19 -0800)]
mon: fix 'mds set allow_new_snaps'
We had already added this as a flag (set/unset) when I generalized the
'mds set_max_mds' to be 'ceph mds set <var> <val>'. Switch the snaps
flag to be a key/value to with true/false (similar to the hashpspool
pool flag) since there are fewer users and the var/val approach is more
general.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
Sage Weil [Mon, 3 Feb 2014 16:54:14 +0000 (08:54 -0800)]
client: use 64-bit value in sync read eof logic
The file size can jump to a value that is very much larger than our current
position (for example, it could be a disk image file that gets a sparse
write at a large offset). Use a 64-bit value so that 'some' doesn't
overflow.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: John Spray <john.spray@inktank.com>
- fix a couple of typo for repo configuration and service restart
- the ceph package must be installed on RPM distro since the init
script relies on ceph-conf
- Note on radosgw service name for RPM distro
David Zafman [Wed, 29 Jan 2014 03:18:32 +0000 (19:18 -0800)]
osd: Move the rest of scrubbing routines to the backend
Move enum scrub_error_type to osd_types.h
Move PG::_compare_scrub_objects to ReplicatedBackend::be_compare_scrub_objects
Move PG::_select_auth_object to ReplicatedBackend::be_select_auth_object
Move PG::_compare_scrubmaps to ReplicatedBackend::be_compare_scrubmaps
Signed-off-by: David Zafman <david.zafman@inktank.com>
Sage Weil [Fri, 31 Jan 2014 15:19:10 +0000 (07:19 -0800)]
os/KeyValueStore: fix warning
./os/KeyValueStore.h: In member function ‘std::string KeyValueStore::strip_object_key(uint64_t)’:
warning: ./os/KeyValueStore.h:173:31: format ‘%ld’ expects argument of type ‘long int’, but argument 4 has type ‘uint64_t {aka long long unsigned int}’ [-Wformat=]
Sage Weil [Thu, 30 Jan 2014 23:13:05 +0000 (15:13 -0800)]
mon/OSDMonitor: encode full OSDMap with same feature bits as the Incremental
Each monitor is independently encoding the full OSDMap and storing it in
its local store. Sometime this happens when we do not have a valid value
for quorum_features (for example, it can happen during sync).
Instead, use the feature bits the Incremental was encoded with for the full
OSDMap so that they always match.
Note that this conveniently the *only* place in the mon where we encode
the full OSDMap, so we're capturing all paths. Yay!
Sage Weil [Thu, 30 Jan 2014 23:09:58 +0000 (15:09 -0800)]
OSDMap: note encoding features in Incremental encoding
The monitor will need to know what features the incremental was encoded
with so that it can encode the OSDMap using the same bits. Introduce a
member that is set during decode. During encode, encoding the value passed
in by the caller.
Loic Dachary [Wed, 29 Jan 2014 10:00:08 +0000 (11:00 +0100)]
ceph-disk: support and test the absence of PATH
Although this is not exactly the context in which ceph-disk is run when
invoked by udev, it makes sure there is at least one sensible way of
using it when PATH is undefined.
Also make src/ceph.in not fail if PATH is not defined.
Haomai Wang [Thu, 30 Jan 2014 03:11:12 +0000 (19:11 -0800)]
FileStore: avoid leveldb check for xattr when possible
Maintain an internal xattr called "spill_out" that indicates whether we
(may) have xattrs stored in omap. If attribute is set, it will indicate
that we should or should not look in omap. If the attribute is not
present, then we do not know and will also need to check.
For new stores, this will avoid the overhead of consulting omap in the
general case until a particular objects gets enough (or big) xattrs and
spills over.
For old stores, we will effectively fall back to the previous behavior
of always checking.
Implements #7059
Signed-off-by: Haomai Wang <haomaiwang@gmail.com> Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
Ilya Dryomov [Wed, 29 Jan 2014 14:12:01 +0000 (16:12 +0200)]
rbd: don't forget to call close_image() if remove_child() fails
close_image() among other things unregisters a watcher that's been
registered by open_image(). Even though it'll timeout in 30 or so
seconds, it's not nice now that we check for watchers before starting
the removal process.
Ilya Dryomov [Wed, 29 Jan 2014 14:12:01 +0000 (16:12 +0200)]
rbd: check for watchers before trimming an image on 'rbd rm'
Check for watchers before trimming image data to try to avoid getting
into the following situation:
- user does 'rbd rm' on a mapped image with an fs mounted from it
- 'rbd rm' trims (removes) all image data, only header is left
- 'rbd rm' tries to remove a header and fails because krbd has a
watcher registered on the header
- at this point image cannot be unmapped because of the mounted fs
- fs cannot be unmounted because all its data and metadata is gone
Unfortunately, this fix doesn't make it impossible to happen (the
required atomicity isn't there), but it's a big improvement over the
status quo.
Ilya Dryomov [Thu, 30 Jan 2014 11:39:15 +0000 (13:39 +0200)]
pybind: work around find_library() not searching LD_LIBRARY_PATH
Commit b28b64a0b6db ("pybind: use find_library for libcephfs and
librbd") switched us to find_library(), but this function doesn't seem
to respect LD_LIBRARY_PATH. There are numerous python tickets, dating
back several years, but alas. Work around it by using the soname as
a fallback. (rados.py has been fixed by commit e46d2ca067b5 ("fix the
bug ctypes.util.find_library to search for librados failed on
Centos6.4.")
Haomai Wang [Wed, 29 Jan 2014 09:50:10 +0000 (17:50 +0800)]
Add KeyValueStore implementation
KeyValueStore is another ObjectStore implementation with FileStore. It
uses KV store wrapper(StripObjectMap) which inherited GenericObjectMap
to implement ObjectStore APIs.
Each object has a header key in KV backend, which encapsulated the metadata
of object such as size, the status of keys. A complete object data maybe spread
around multi keys. The CRUD operation of object need to access the header key
of object to know the details, then the actual data keys will be get.
Now the actual KV backend of KeyValueStore is only LevelDB, more KV backend
(RocksDB, NVM API) will be introduced in the near future.
Haomai Wang [Wed, 29 Jan 2014 09:46:00 +0000 (17:46 +0800)]
Add a new KV wrapper GenericObjectMap
Now we already have DBObjectMap which implement ObjectMap and other
interfaces, and ObjectMap.h implied that ObjectMap is used to encapsulates
the FileStore key value store. There exists limitation in current DBObjectMap
implementation, such as lacking of "coll_t" in "key", complicated prefix
hard-coded and inflexible extending.
So in order to provide a more flexible API and clear implementation to wrap KV
store, I copy the origin DBObjectMap and redesign the partial implementation.
Adding "coll_t" argument to all API and export "prefix" to callers. Prefixes
are divided into two parts "INTERN" and "USER". "INTERN" keys used by self to
manage and "USER" keys are managed by callers. Besides above, misc fixes are
imported such as more clear member function name and extendible header
structure.