Josh Durgin [Thu, 24 Oct 2013 15:42:48 +0000 (08:42 -0700)]
rgw: escape bucket and object names in StreamReadRequests
This fixes copy operations for objects that contain unsafe characters,
like a newline, which would return a 403 otherwise, since the GET to
the source rgw would be unable to verify the signature on a partially
valid bucket name.
Josh Durgin [Thu, 24 Oct 2013 15:37:25 +0000 (08:37 -0700)]
rgw: move url escaping to a common place
This is useful outside of the s3 interface. Rename url_escape()
url_encode() for consistency with the exsting common url_decode()
function. This is in preparation for the next commit, which needs
to escape url-unsafe characters in another place.
Josh Durgin [Thu, 24 Oct 2013 15:34:24 +0000 (08:34 -0700)]
rgw: update metadata log list to match data log list
Send the last marker whether the log is truncated in the same format
as data log list, so clients don't have more needless complexity
handling the difference. Keep bucket index logs the same, since they
contain the marker already, and are not used in exactly the same way
metadata and data logs are.
Josh Durgin [Thu, 24 Oct 2013 15:26:19 +0000 (08:26 -0700)]
rgw: include marker and truncated flag in data log list api
Consumers of this api need to know their position in the log. It's
readily available when fetching the log, so return it. Without the
marker in this call, a client could not easily or efficiently figure
out its position in the log, since it would require getting the global
last marker in the log, and then reading all the log entries.
This would be slow for large logs, and would be subject to races that
would cause potentially very expensive duplicate work.
Returning this atomically while fetching the log entries simplifies
all of this.
Josh Durgin [Thu, 24 Oct 2013 15:18:19 +0000 (08:18 -0700)]
cls_log: always return final marker from log_list
There's no reason to restrict returning the marker to the case where
less than the whole log is returned, since there's already a truncated
flag to tell the client what happened.
Giving the client the last marker makes it easy to consume when the
log entries do not contain their own marker. If the last marker is not
returned, the client cannot get the last marker without racing with
updates to the log.
Yan, Zheng [Thu, 10 Oct 2013 02:35:48 +0000 (10:35 +0800)]
mds: fix infinite loop of MDCache::populate_mydir().
make MDCache::populate_mydir() only fetch bare-bone stray dirs.
After all stray dirs are populated, call MDCache::scan_stray_dir(),
it fetches incomplete stray dirs.
Fixes: #4405 Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Greg Farnum <greg@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 007f06ec174d4ee5cfb578c8b3f1c96b2bb0c238)
Yehuda Sadeh [Tue, 15 Oct 2013 17:20:48 +0000 (10:20 -0700)]
rgw: fix authenticated users acl group check
Fixes: #6553
Backport: bobtail, cuttlefish, dumpling
Authenticated users group acl bit was not working correctly. Check to
test whether user is anonymous was wrong.
osdc/ObjectCacher: finish contexts after dropping object reference
The context to finish can be class C_Client_PutInode, which may drop
inode's last reference. So we should first drop object's reference,
then finish contexts.
Sandon Van Ness [Tue, 8 Oct 2013 19:08:08 +0000 (12:08 -0700)]
Go back to $PWD in fsstress.sh if compiling from source.
Although fsstress was being called with a static path the directory
it was writing to was in the current directory so doing a cd to the
source directory that is made in /tmp and then removing it later
caused it to be unable to write the files in a non-existent dir.
This change gets the current path first and cd's back into it after
it is done compiling fsstress.
Issue #6479.
Signed-off-by: Sandon Van Ness <sandon@inktank.com> Reviewed-by: Alfredo Deza <alfredo.deza@inktank.com>
Signed-off-by: David Zafman <david.zafman@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 139a714e13aa3c7f42091270b55dde8a17b3c4b8)
mon: OSDMonitor: do not write full_latest during trim
On commit 81983bab we patched OSDMonitor::update_from_paxos() such that we
write the latest full map version to 'full_latest' each time the latest
full map was built from the incremental versions.
This change however clashed with OSDMonitor::encode_trim_extra(), which
also wrote to 'full_latest' on each trim, writing instead the version of
the *oldest* full map. This duality of behaviors could lead the store
to an inconsistent state across the monitors (although there's no sign of
it actually imposing any issues besides rebuilding already existing full
maps on some monitors).
We now stop OSDMonitor::encode_trim_extra() from writing to 'full_latest'.
This function will still write out the oldest full map it has in the store,
but it will no longer write to full_latest, instead leaving it up to
OSDMonitor::update_from_paxos() to figure it out -- and it already does.
Fixes: #6378
Backport: dumpling
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit bd0f29a2c28cca496ec830eac932477ebf3182ba)
We now put CEPH_ARGS in the actual args we parse in python, which are passed
to rados piecemeal later. This lets you put things like --id ... in there
that need to be parsed before librados is initialized.
(cherry picked from commit 97f462be4829f0167ed3d65e6694dfc16f1f3243)
David Zafman [Mon, 9 Sep 2013 20:01:12 +0000 (13:01 -0700)]
crushtool: do not dump core with non-unique bucket IDs
Return -EEXIST on duplicate ID
BUG FIX: crush_add_bucket() mixes error returns and IDs
Add optional argument to return generated ID
Fixes: #6246 Signed-off-by: David Zafman <david.zafman@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 8c76f3a0f9cf100ea2c941dc2b61c470aa5033d7)
perfglue/heap_profiler.cc: expect args as first element on cmd vector
We used to pass 'heap' as the first element of the cmd vector when
handling commands. We haven't been doing so for a while now, so we
needed to fix this.
Not expecting 'heap' also makes sense, considering that what we need to
know when we reach this function is what command we should handle, and
we should not care what the caller calls us when handling his business.
Fixes: #6176
Backport: dumpling
We take different code paths in copy_obj, make sure we close the handle
when we exit the function. Move the call to finish_get_obj() out of
copy_obj_data() as we don't create the handle there, so that should
makes code less confusing and less prone to errors.
Also, note that RGWRados::get_obj() also calls finish_get_obj(). For
everything to work in concert we need to pass a pointer to the handle
and not the handle itself. Therefore we needed to also change the call
to copy_obj_data().
mon: OSDMonitor: update latest_full while rebuilding full maps
Not doing so will make the monitor rebuild the osdmap full versions, even
though they may have been rebuilt before, every time the monitor starts.
This mostly happens when the cluster is left in an unhealthy state for
a long period of time and incremental versions build up. Even though we
build the full maps on update_from_paxos(), not updating 'full_latest'
leads to the situation initially described.
mon: OSDMonitor: smaller transactions when rebuilding full versions
Otherwise, for considerably sized rebuilds, the monitor will not only
consume vast amounts of memory, but it will also have troubles committing
the transaction. Anyway, it's also a good idea to adjust transactions to
the granularity we want, and to be fair we care that each rebuilt full map
gets to disk, even if subsequent full maps don't (those can be rebuilt
later).
Fixes: #6175
Backport: dumpling
We get a buffer off the remote gateway which might
not be NULL terminated. The JSON parser needs the
buffer to be NULL terminated even though we provide
a buffer length as it calls strlen().
rgw: drain pending requests before completing write
Fixes: #6268
When doing aio write of objects (either regular or multipart parts) we
need to drain pending aio requests. Otherwise if gateway goes down then
object might end up corrupted.
Yehuda Sadeh [Thu, 22 Aug 2013 00:22:46 +0000 (17:22 -0700)]
rgw: OPTIONS request doesn't need to read object info
This is a bucket-only operation, so we shouldn't look at the
object. Object may not exist and we might respond with Not
Exists response which is not what we want.
Samuel Just [Thu, 22 Aug 2013 18:19:37 +0000 (11:19 -0700)]
FileStore: add config option to disable the wbthrottle
Backport: dumpling Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 3528100a53724e7ae20766344e467bf762a34163)
Samuel Just [Thu, 22 Aug 2013 18:19:52 +0000 (11:19 -0700)]
WBThrottle: use fdatasync instead of fsync
Backport: dumpling Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit d571825080f0bff1ed3666e95e19b78a738ecfe8)
Samuel Just [Tue, 27 Aug 2013 06:19:45 +0000 (23:19 -0700)]
PGLog: move the log size check after the early return
There really are stl implementations (like the one on my ubuntu 12.04
machine) which have a list::size() which is linear in the size of the
list. That assert, therefore, is quite expensive!
Yehuda Sadeh [Fri, 23 Aug 2013 22:39:20 +0000 (15:39 -0700)]
rgw: flush pending data when completing multipart part upload
Fixes: #6111
Backport: dumpling
When completing the part upload we need to flush any data that we
aggregated and didn't flush yet. With earlier code didn't have to deal
with it as for multipart upload we didn't have any pending data.
What we do now is we call the regular atomic data completion
function that takes care of it.
When posting an object it is possible to provide a key
name that refers to the original filename, however we
need to verify that in the end we don't end up with an
empty object name.
Sage Weil [Fri, 23 Aug 2013 00:46:45 +0000 (17:46 -0700)]
mon/MonClient: release pending outgoing messages on shutdown
This fixes a small memory leak when we have messages queued for the mon
when we shut down. It is harmless except for the valgrind leak check
noise that obscures real leaks.
Yehuda Sadeh [Thu, 29 Aug 2013 20:06:33 +0000 (13:06 -0700)]
rgw: change watch init ordering, don't distribute if can't
Backport: dumpling
Moving back the watch initialization after the zone init,
as the zone info holds the control pool name. Since zone
init might need to create a new system object (that needs
to distribute cache), don't try to distribute cache if
watch is not yet initialized.
Sage Weil [Wed, 28 Aug 2013 16:50:11 +0000 (09:50 -0700)]
mon: discover mon addrs, names during election state too
Currently we only detect new mon addrs and names during the probing phase.
For non-trivial clusters, this means we can get into a sticky spot when
we discover enough peers to form an quorum, but not all of them, and the
undiscovered ones are enough to break the mon ranks and prevent an
election.
One way to work around this is to continue addr and name discovery during
the election. We should also consider making the ranks less sensitive to
the undefined addrs; that is a separate change.
Fixes: #4924
Backport: dumpling Signed-off-by: Sage Weil <sage@inktank.com> Tested-by: Bernhard Glomm <bernhard.glomm@ecologic.eu>
(cherry picked from commit c24028570015cacf1d9e154ffad80bec06a61e7c)
Dan Mick [Fri, 23 Aug 2013 00:30:24 +0000 (17:30 -0700)]
ceph_rest_api.py: create own default for log_file
common/config thinks the default log_file for non-daemons should be "".
Override that so that the default is
/var/log/ceph/{cluster}-{name}.{pid}.log
since ceph-rest-api is more of a daemon than a client.
Sage Weil [Sat, 17 Aug 2013 00:59:11 +0000 (17:59 -0700)]
ceph-post-file: single command to upload a file to cephdrop
Use sftp to upload to a directory that only this user and ceph devs can
access.
Distribute an ssh key to connect to the account. This will let us revoke
the key in the future if we feel the need. Also distribute a known_hosts
file so that users have some confidence that they are connecting to the
real ceph drop account and not some third party.
Gary Lowell [Thu, 22 Aug 2013 18:07:16 +0000 (11:07 -0700)]
ceph.spec.in: Don't invoke debug_package macro on centos.
If the redhat-rpm-config package is installed, the debuginfo rpms will
be built by default. The build will fail when the package installed
and the specfile also invokes the macro.
Signed-off-by: Gary Lowell <gary.lowell@inktank.com>
Sage Weil [Sat, 24 Aug 2013 21:04:09 +0000 (14:04 -0700)]
osd: install admin socket commands after signals
This lets us tell by the presence of the admin socket commands whether
a signal will make us shut down cleanly. See #5924.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit c5b5ce120a8ce9116be52874dbbcc39adec48b5c)
Sage Weil [Wed, 21 Aug 2013 05:39:09 +0000 (22:39 -0700)]
ceph-disk: partprobe after creating journal partition
At least one user reports that a partprobe is needed after creating the
journal partition. It is not clear why sgdisk is not doing it, but this
fixes ceph-disk for them, and should be harmless for other users.
Sage Weil [Tue, 20 Aug 2013 18:27:23 +0000 (11:27 -0700)]
mon/Paxos: always refresh after any store_state
If we store any new state, we need to refresh the services, even if we
are still in the midst of Paxos recovery. This is because the
subscription path will share any committed state even when paxos is
still recovering. This prevents a race like:
- we have maps 10..20
- we drop out of quorum
- we are elected leader, paxos recovery starts
- we get one LAST with committed states that trim maps 10..15
- we get a subscribe for map 10..20
- we crash because 10 is no longer on disk because the PaxosService
is out of sync with the on-disk state.
Fixes: #6045
Backport: dumpling Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit 981eda9f7787c83dc457f061452685f499e7dd27)
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit 7e0848d8f88f156a05eef47a9f730b772b64fbf2)
Sage Weil [Tue, 20 Aug 2013 18:26:57 +0000 (11:26 -0700)]
mon/Paxos: cleanup: use do_refresh from handle_commit
This avoid duplicated code by using the helper created exactly for this
purpose.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit b9dee2285d9fe8533fa98c940d5af7b0b81f3d33)
Sage Weil [Fri, 16 Aug 2013 04:48:06 +0000 (21:48 -0700)]
osdc/ObjectCacher: do not merge rx buffers
We do not try to merge rx buffers currently. Make that explicit and
documented in the code that it is not supported. (Otherwise the
last_read_tid values will get lost and read results won't get applied
to the cache properly.)
Sage Weil [Fri, 16 Aug 2013 04:47:18 +0000 (21:47 -0700)]
osdc/ObjectCacher: match reads with their original rx buffers
Consider a sequence like:
1- start read on 100~200
100~200 state rx
2- truncate to 200
100~100 state rx
3- start read on 200~200
100~100 state rx
200~200 state rx
4- get 100~200 read result
when processing the second 200~200 bufferhead (it is too big). The
larger issue, though, is that we should not be looking at this data at
all; it has been truncated away.
Fix this by marking each rx buffer with the read request that is sent to
fill it, and only fill it from that read request. Then the first reply
will fill the first 100~100 extend but not touch the other extent; the
second read will do that.
Sage Weil [Thu, 22 Aug 2013 22:54:48 +0000 (15:54 -0700)]
mon/Paxos: fix another uncommitted value corner case
It is possible that we begin the paxos recovery with an uncommitted
value for, say, commit 100. During last/collect we discover 100 has been
committed already. But also, another node provides an uncommitted value
for 101 with the same pn. Currently, we refuse to learn it, because the
pn is not strictly > than our current uncommitted pn... even though it is
the next last_committed+1 value that we need.
There are two possible fixes here:
- make this a >= as we can accept newer values from the same pn.
- discard our uncommitted value metadata when we commit the value.
Yehuda Sadeh [Mon, 19 Aug 2013 15:40:16 +0000 (08:40 -0700)]
rgw: change cache / watch-notify init sequence
Fixes: #6046
We were initializing the watch-notify (through the cache
init) before reading the zone info which was much too
early, as we didn't have the control pool name yet. Now
simplifying init/cleanup a bit, cache doesn't call watch/notify
init and cleanup directly, but rather states its need
through a virtual callback.
Alexandre Oliva [Thu, 22 Aug 2013 06:40:22 +0000 (03:40 -0300)]
enable mds rejoin with active inodes' old parent xattrs
When the parent xattrs of active inodes that the mds attempts to open
during rejoin lack pool info (struct_v < 5), this field will be filled
in with -1, causing the mds to retry fetching a backtrace with a pool
number that matches the expected value, which fails and causes the
err==-ENOENT branch to be taken and retry pool 1, which succeeds, but
with pool -1, and so keeps on bouncing between the two retry cases
forever.
This patch arranges for the mds to go along with pool -1 instead of
insisting that it be refetched, enabling it to complete recovery
instead of eating cpu, network bandwidth and metadata osd's resources
like there's no tomorrow, in what AFAICT is an infinite and very busy
loop.
This is not a new problem: I've had it even before upgrading from
Cuttlefish to Dumpling, I'd just never managed to track it down, and
force-unmounting the filesystem and then restarting the mds was an
easier (if inconvenient) work-around, particularly because it always
hit when the filesystem was under active, heavy-ish use (or there
wouldn't be much reason for caps recovery ;-)
There are two issues not addressed in this patch, however. One is
that nothing seems to proactively update the parent xattr when it is
found to be outdated, so it remains out of date forever. Not even
renaming top-level directories causes the xattrs to be recursively
rewritten. AFAICT that's a bug.
The other is that inodes that don't have a parent xattr (created by
even older versions of ceph) are reported as non-existing in the mds
rejoin message, because the absence of the parent xattr is signaled as
a missing inode (?failed to reconnect caps for missing inodes?). I
suppose this may cause more serious recovery problems.
I suppose a global pass over the filesystem tree updating parent
xattrs that are out-of-date would be desirable, if we find any parent
xattrs still lacking current information; it might make sense to
activate it as a background thread from the backtrace decoding
function, when it finds a parent xattr that's too out-of-date, or as a
separate client (ceph-fsck?).
Backport: dumpling, cuttlefish Signed-off-by: Alexandre Oliva <oliva@gnu.org> Reviewed-by: Zheng, Yan <zheng.z.yan@intel.com>
(cherry picked from commit 617dc36d477fd83b2d45034fe6311413aa1866df)
David Disseldorp [Mon, 29 Jul 2013 15:05:44 +0000 (17:05 +0200)]
mds: remove waiting lock before merging with neighbours
CephFS currently deadlocks under CTDB's ping_pong POSIX locking test
when run concurrently on multiple nodes.
The deadlock is caused by failed removal of a waiting_locks entry when
the waiting lock is merged with an existing lock, e.g:
Note that the waiting 4116@1:1 lock entry is merged with the existing
4116@0:1 held lock to become a 4116@0:2 held lock. However, the now
handled 4116@1:1 waiting_locks entry remains.
When handling a lock request, the MDS calls adjust_locks() to merge
the new lock with available neighbours. If the new lock is merged,
then the waiting_locks entry is not located in the subsequent
remove_waiting() call because adjust_locks changed the new lock to
include the old locks.
This fix ensures that the waiting_locks entry is removed prior to
modification during merge.
Signed-off-by: David Disseldorp <ddiss@suse.de> Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 476e4902907dfadb3709ba820453299ececf990b)