Sage Weil [Wed, 4 Dec 2013 05:51:26 +0000 (21:51 -0800)]
osd/OSDMonitor: accept 'osd pool set ...' value as string
Newer monitors take this as a CephString. Accept that so that if we are
mid-upgrade and get a forwarded message using the alternate schema from
a future mon we will handle it properly.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
Josh Durgin [Mon, 25 Nov 2013 21:43:43 +0000 (13:43 -0800)]
init, upstart: prevent daemons being started by both
There can be only one init system starting a daemon. If there is a
host entry in ceph.conf for a daemon, sysvinit would try to start it
even if the daemon's directory did not include a sysvinit file. This
preserves backwards compatibility with older installs using sysvinit,
but if an upstart file is present in the daemon's directory, upstart
will try to start them, regardless of host entries in ceph.conf.
If there's an upstart file in a daemon's directory and a host entry
for that daemon in ceph.conf, both sysvinit and upstart would attempt
to manage it.
Fix this by only starting daemons if the marker file for the other
init system is not present. This maintains backwards compatibility
with older installs using neither sysvinit or upstart marker files,
and does not break any valid configurations. The only configuration
that would break is one with both sysvinit and upstart files present
for the same daemon.
Backport: emperor, dumpling Reported-by: Tim Spriggs <tims@uahirise.org> Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 5e34beb61b3f5a1ed4afd8ee2fe976de40f95ace)
Samuel Just [Mon, 4 Nov 2013 05:02:36 +0000 (21:02 -0800)]
OSD: allow project_pg_history to handle a missing map
If we get a peering message for an old map we don't have, we
can throwit out: the sending OSD will learn about the newer
maps and update itself accordingly, and we don't have the
information to know if the message is valid. This situation
can only happen if the sender was down for a long enough time
to create a map gap and its PGs have not yet advanced from
their boot-up maps to the current ones, so we can rely on it
Fixes: #6712 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit cd0d612e1abdf5c87082eeeccd4ca09dd14fd737)
Yehuda Sadeh [Mon, 19 Aug 2013 23:56:27 +0000 (16:56 -0700)]
rgw: bucket meta remove don't overwrite entry point first
Fixes: #6056
When removing a bucket metadata entry we first unlink the bucket
and then we remove the bucket entrypoint object. Originally
when unlinking the bucket we first overwrote the bucket entrypoint
entry marking it as 'unlinked'. However, this is not really needed
as we're just about to remove it. The original version triggered
a bug, as we needed to propagate the new header version first (which
we didn't do, so the subsequent bucket removal failed).
Josh Durgin [Mon, 18 Nov 2013 22:39:12 +0000 (14:39 -0800)]
osd: fix bench block size
The command was declared to take 'size' in dumpling, but was trying to
read 'bsize' instead, so it always used the default of 4MiB. Change
the bench command to read 'size', so it matches what existing clients
are sending.
David Zafman [Wed, 25 Sep 2013 16:19:16 +0000 (09:19 -0700)]
os, osd, tools: Add backportable compatibility checking for sharded objects
OSD
New CEPH_OSD_FEATURE_INCOMPAT_SHARDS
FileStore
NEW CEPH_FS_FEATURE_INCOMPAT_SHARDS
Add FSSuperblock with feature CompatSet in it
Store sharded_objects state using CompatSet
Add set_allow_sharded_objects() and get_allow_sharded_objects() to FileStore/ObjectStore
Add read_superblock()/write_superblock() internal filestore functions
ceph_filestore_dump
Add OSDsuperblock to export format
Use CompatSet from OSD code itself in filestore-dump tool
Always check compatibility of OSD features with on-disk features
On import verify compatibility of on-disk features with export data
Bump super_ver due to export format change
Backport: dumpling, cuttlefish
Signed-off-by: David Zafman <david.zafman@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit c6b83180f9f769de27ca7890f5f8ec507ee743ca)
Excluded from cherry-pick:
Didn't add set_allow_sharded_objects() and get_allow_sharded_objects() to FileStore/ObjectStore
Didn't add code to check for incomplete transition to sharded objects in ceph-filestore-dump
rgw: when failing read from client, return correct error
Fixes: #6214
When getting a failed read from client when putting an object
we returned the wrong value (always 0), which in the chunked-
upload case ended up in assuming that the write was done
successfully.
Yehuda Sadeh [Mon, 26 Aug 2013 18:16:08 +0000 (11:16 -0700)]
rgw: quiet down warning message
Fixes: #6123
We don't want to know about failing to read region map info
if it's not found, only if failed on some other error. In
any case it's just a warning.
Josh Durgin [Thu, 24 Oct 2013 15:42:48 +0000 (08:42 -0700)]
rgw: escape bucket and object names in StreamReadRequests
This fixes copy operations for objects that contain unsafe characters,
like a newline, which would return a 403 otherwise, since the GET to
the source rgw would be unable to verify the signature on a partially
valid bucket name.
Josh Durgin [Thu, 24 Oct 2013 15:37:25 +0000 (08:37 -0700)]
rgw: move url escaping to a common place
This is useful outside of the s3 interface. Rename url_escape()
url_encode() for consistency with the exsting common url_decode()
function. This is in preparation for the next commit, which needs
to escape url-unsafe characters in another place.
Josh Durgin [Thu, 24 Oct 2013 15:34:24 +0000 (08:34 -0700)]
rgw: update metadata log list to match data log list
Send the last marker whether the log is truncated in the same format
as data log list, so clients don't have more needless complexity
handling the difference. Keep bucket index logs the same, since they
contain the marker already, and are not used in exactly the same way
metadata and data logs are.
Josh Durgin [Thu, 24 Oct 2013 15:26:19 +0000 (08:26 -0700)]
rgw: include marker and truncated flag in data log list api
Consumers of this api need to know their position in the log. It's
readily available when fetching the log, so return it. Without the
marker in this call, a client could not easily or efficiently figure
out its position in the log, since it would require getting the global
last marker in the log, and then reading all the log entries.
This would be slow for large logs, and would be subject to races that
would cause potentially very expensive duplicate work.
Returning this atomically while fetching the log entries simplifies
all of this.
Josh Durgin [Thu, 24 Oct 2013 15:18:19 +0000 (08:18 -0700)]
cls_log: always return final marker from log_list
There's no reason to restrict returning the marker to the case where
less than the whole log is returned, since there's already a truncated
flag to tell the client what happened.
Giving the client the last marker makes it easy to consume when the
log entries do not contain their own marker. If the last marker is not
returned, the client cannot get the last marker without racing with
updates to the log.
Yan, Zheng [Thu, 10 Oct 2013 02:35:48 +0000 (10:35 +0800)]
mds: fix infinite loop of MDCache::populate_mydir().
make MDCache::populate_mydir() only fetch bare-bone stray dirs.
After all stray dirs are populated, call MDCache::scan_stray_dir(),
it fetches incomplete stray dirs.
Fixes: #4405 Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Greg Farnum <greg@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 007f06ec174d4ee5cfb578c8b3f1c96b2bb0c238)
Yehuda Sadeh [Tue, 15 Oct 2013 17:20:48 +0000 (10:20 -0700)]
rgw: fix authenticated users acl group check
Fixes: #6553
Backport: bobtail, cuttlefish, dumpling
Authenticated users group acl bit was not working correctly. Check to
test whether user is anonymous was wrong.
osdc/ObjectCacher: finish contexts after dropping object reference
The context to finish can be class C_Client_PutInode, which may drop
inode's last reference. So we should first drop object's reference,
then finish contexts.
Sandon Van Ness [Tue, 8 Oct 2013 19:08:08 +0000 (12:08 -0700)]
Go back to $PWD in fsstress.sh if compiling from source.
Although fsstress was being called with a static path the directory
it was writing to was in the current directory so doing a cd to the
source directory that is made in /tmp and then removing it later
caused it to be unable to write the files in a non-existent dir.
This change gets the current path first and cd's back into it after
it is done compiling fsstress.
Issue #6479.
Signed-off-by: Sandon Van Ness <sandon@inktank.com> Reviewed-by: Alfredo Deza <alfredo.deza@inktank.com>
Signed-off-by: David Zafman <david.zafman@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 139a714e13aa3c7f42091270b55dde8a17b3c4b8)
mon: OSDMonitor: do not write full_latest during trim
On commit 81983bab we patched OSDMonitor::update_from_paxos() such that we
write the latest full map version to 'full_latest' each time the latest
full map was built from the incremental versions.
This change however clashed with OSDMonitor::encode_trim_extra(), which
also wrote to 'full_latest' on each trim, writing instead the version of
the *oldest* full map. This duality of behaviors could lead the store
to an inconsistent state across the monitors (although there's no sign of
it actually imposing any issues besides rebuilding already existing full
maps on some monitors).
We now stop OSDMonitor::encode_trim_extra() from writing to 'full_latest'.
This function will still write out the oldest full map it has in the store,
but it will no longer write to full_latest, instead leaving it up to
OSDMonitor::update_from_paxos() to figure it out -- and it already does.
Fixes: #6378
Backport: dumpling
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit bd0f29a2c28cca496ec830eac932477ebf3182ba)
We now put CEPH_ARGS in the actual args we parse in python, which are passed
to rados piecemeal later. This lets you put things like --id ... in there
that need to be parsed before librados is initialized.
(cherry picked from commit 97f462be4829f0167ed3d65e6694dfc16f1f3243)
David Zafman [Mon, 9 Sep 2013 20:01:12 +0000 (13:01 -0700)]
crushtool: do not dump core with non-unique bucket IDs
Return -EEXIST on duplicate ID
BUG FIX: crush_add_bucket() mixes error returns and IDs
Add optional argument to return generated ID
Fixes: #6246 Signed-off-by: David Zafman <david.zafman@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 8c76f3a0f9cf100ea2c941dc2b61c470aa5033d7)
perfglue/heap_profiler.cc: expect args as first element on cmd vector
We used to pass 'heap' as the first element of the cmd vector when
handling commands. We haven't been doing so for a while now, so we
needed to fix this.
Not expecting 'heap' also makes sense, considering that what we need to
know when we reach this function is what command we should handle, and
we should not care what the caller calls us when handling his business.
Fixes: #6176
Backport: dumpling
We take different code paths in copy_obj, make sure we close the handle
when we exit the function. Move the call to finish_get_obj() out of
copy_obj_data() as we don't create the handle there, so that should
makes code less confusing and less prone to errors.
Also, note that RGWRados::get_obj() also calls finish_get_obj(). For
everything to work in concert we need to pass a pointer to the handle
and not the handle itself. Therefore we needed to also change the call
to copy_obj_data().
mon: OSDMonitor: update latest_full while rebuilding full maps
Not doing so will make the monitor rebuild the osdmap full versions, even
though they may have been rebuilt before, every time the monitor starts.
This mostly happens when the cluster is left in an unhealthy state for
a long period of time and incremental versions build up. Even though we
build the full maps on update_from_paxos(), not updating 'full_latest'
leads to the situation initially described.
mon: OSDMonitor: smaller transactions when rebuilding full versions
Otherwise, for considerably sized rebuilds, the monitor will not only
consume vast amounts of memory, but it will also have troubles committing
the transaction. Anyway, it's also a good idea to adjust transactions to
the granularity we want, and to be fair we care that each rebuilt full map
gets to disk, even if subsequent full maps don't (those can be rebuilt
later).
Fixes: #6175
Backport: dumpling
We get a buffer off the remote gateway which might
not be NULL terminated. The JSON parser needs the
buffer to be NULL terminated even though we provide
a buffer length as it calls strlen().
rgw: drain pending requests before completing write
Fixes: #6268
When doing aio write of objects (either regular or multipart parts) we
need to drain pending aio requests. Otherwise if gateway goes down then
object might end up corrupted.
Yehuda Sadeh [Thu, 22 Aug 2013 00:22:46 +0000 (17:22 -0700)]
rgw: OPTIONS request doesn't need to read object info
This is a bucket-only operation, so we shouldn't look at the
object. Object may not exist and we might respond with Not
Exists response which is not what we want.
Samuel Just [Thu, 22 Aug 2013 18:19:37 +0000 (11:19 -0700)]
FileStore: add config option to disable the wbthrottle
Backport: dumpling Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 3528100a53724e7ae20766344e467bf762a34163)
Samuel Just [Thu, 22 Aug 2013 18:19:52 +0000 (11:19 -0700)]
WBThrottle: use fdatasync instead of fsync
Backport: dumpling Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit d571825080f0bff1ed3666e95e19b78a738ecfe8)
Samuel Just [Tue, 27 Aug 2013 06:19:45 +0000 (23:19 -0700)]
PGLog: move the log size check after the early return
There really are stl implementations (like the one on my ubuntu 12.04
machine) which have a list::size() which is linear in the size of the
list. That assert, therefore, is quite expensive!
Yehuda Sadeh [Fri, 23 Aug 2013 22:39:20 +0000 (15:39 -0700)]
rgw: flush pending data when completing multipart part upload
Fixes: #6111
Backport: dumpling
When completing the part upload we need to flush any data that we
aggregated and didn't flush yet. With earlier code didn't have to deal
with it as for multipart upload we didn't have any pending data.
What we do now is we call the regular atomic data completion
function that takes care of it.
When posting an object it is possible to provide a key
name that refers to the original filename, however we
need to verify that in the end we don't end up with an
empty object name.
Sage Weil [Fri, 23 Aug 2013 00:46:45 +0000 (17:46 -0700)]
mon/MonClient: release pending outgoing messages on shutdown
This fixes a small memory leak when we have messages queued for the mon
when we shut down. It is harmless except for the valgrind leak check
noise that obscures real leaks.
Yehuda Sadeh [Thu, 29 Aug 2013 20:06:33 +0000 (13:06 -0700)]
rgw: change watch init ordering, don't distribute if can't
Backport: dumpling
Moving back the watch initialization after the zone init,
as the zone info holds the control pool name. Since zone
init might need to create a new system object (that needs
to distribute cache), don't try to distribute cache if
watch is not yet initialized.
Sage Weil [Wed, 28 Aug 2013 16:50:11 +0000 (09:50 -0700)]
mon: discover mon addrs, names during election state too
Currently we only detect new mon addrs and names during the probing phase.
For non-trivial clusters, this means we can get into a sticky spot when
we discover enough peers to form an quorum, but not all of them, and the
undiscovered ones are enough to break the mon ranks and prevent an
election.
One way to work around this is to continue addr and name discovery during
the election. We should also consider making the ranks less sensitive to
the undefined addrs; that is a separate change.
Fixes: #4924
Backport: dumpling Signed-off-by: Sage Weil <sage@inktank.com> Tested-by: Bernhard Glomm <bernhard.glomm@ecologic.eu>
(cherry picked from commit c24028570015cacf1d9e154ffad80bec06a61e7c)
Dan Mick [Fri, 23 Aug 2013 00:30:24 +0000 (17:30 -0700)]
ceph_rest_api.py: create own default for log_file
common/config thinks the default log_file for non-daemons should be "".
Override that so that the default is
/var/log/ceph/{cluster}-{name}.{pid}.log
since ceph-rest-api is more of a daemon than a client.
Sage Weil [Sat, 17 Aug 2013 00:59:11 +0000 (17:59 -0700)]
ceph-post-file: single command to upload a file to cephdrop
Use sftp to upload to a directory that only this user and ceph devs can
access.
Distribute an ssh key to connect to the account. This will let us revoke
the key in the future if we feel the need. Also distribute a known_hosts
file so that users have some confidence that they are connecting to the
real ceph drop account and not some third party.
Gary Lowell [Thu, 22 Aug 2013 18:07:16 +0000 (11:07 -0700)]
ceph.spec.in: Don't invoke debug_package macro on centos.
If the redhat-rpm-config package is installed, the debuginfo rpms will
be built by default. The build will fail when the package installed
and the specfile also invokes the macro.
Signed-off-by: Gary Lowell <gary.lowell@inktank.com>
Sage Weil [Sat, 24 Aug 2013 21:04:09 +0000 (14:04 -0700)]
osd: install admin socket commands after signals
This lets us tell by the presence of the admin socket commands whether
a signal will make us shut down cleanly. See #5924.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit c5b5ce120a8ce9116be52874dbbcc39adec48b5c)
Sage Weil [Wed, 21 Aug 2013 05:39:09 +0000 (22:39 -0700)]
ceph-disk: partprobe after creating journal partition
At least one user reports that a partprobe is needed after creating the
journal partition. It is not clear why sgdisk is not doing it, but this
fixes ceph-disk for them, and should be harmless for other users.