mon: OSDMonitor: do not write full_latest during trim
On commit 81983bab we patched OSDMonitor::update_from_paxos() such that we
write the latest full map version to 'full_latest' each time the latest
full map was built from the incremental versions.
This change however clashed with OSDMonitor::encode_trim_extra(), which
also wrote to 'full_latest' on each trim, writing instead the version of
the *oldest* full map. This duality of behaviors could lead the store
to an inconsistent state across the monitors (although there's no sign of
it actually imposing any issues besides rebuilding already existing full
maps on some monitors).
We now stop OSDMonitor::encode_trim_extra() from writing to 'full_latest'.
This function will still write out the oldest full map it has in the store,
but it will no longer write to full_latest, instead leaving it up to
OSDMonitor::update_from_paxos() to figure it out -- and it already does.
Fixes: #6378
Backport: dumpling
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Fixes: #6176
Backport: dumpling
We take different code paths in copy_obj, make sure we close the handle
when we exit the function. Move the call to finish_get_obj() out of
copy_obj_data() as we don't create the handle there, so that should
makes code less confusing and less prone to errors.
Also, note that RGWRados::get_obj() also calls finish_get_obj(). For
everything to work in concert we need to pass a pointer to the handle
and not the handle itself. Therefore we needed to also change the call
to copy_obj_data().
perfglue/heap_profiler.cc: expect args as first element on cmd vector
We used to pass 'heap' as the first element of the cmd vector when
handling commands. We haven't been doing so for a while now, so we
needed to fix this.
Not expecting 'heap' also makes sense, considering that what we need to
know when we reach this function is what command we should handle, and
we should not care what the caller calls us when handling his business.
Fixes: #6361
Backport: dumpling
Signed-off-by: Joao Eduardo Luis <jecluis@gmail.com>
mon: OSDMonitor: update latest_full while rebuilding full maps
Not doing so will make the monitor rebuild the osdmap full versions, even
though they may have been rebuilt before, every time the monitor starts.
This mostly happens when the cluster is left in an unhealthy state for
a long period of time and incremental versions build up. Even though we
build the full maps on update_from_paxos(), not updating 'full_latest'
leads to the situation initially described.
Fixes: #6322 Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
mon: OSDMonitor: smaller transactions when rebuilding full versions
Otherwise, for considerably sized rebuilds, the monitor will not only
consume vast amounts of memory, but it will also have troubles committing
the transaction. Anyway, it's also a good idea to adjust transactions to
the granularity we want, and to be fair we care that each rebuilt full map
gets to disk, even if subsequent full maps don't (those can be rebuilt
later).
Fixes: #6323 Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Fixes: #6175
Backport: dumpling
We get a buffer off the remote gateway which might
not be NULL terminated. The JSON parser needs the
buffer to be NULL terminated even though we provide
a buffer length as it calls strlen().
rgw: drain pending requests before completing write
Fixes: #6268
When doing aio write of objects (either regular or multipart parts) we
need to drain pending aio requests. Otherwise if gateway goes down then
object might end up corrupted.
rgw: when failing read from client, return correct error
Fixes: #6214
When getting a failed read from client when putting an object
we returned the wrong value (always 0), which in the chunked-
upload case ended up in assuming that the write was done
successfully.
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Yehuda Sadeh [Fri, 23 Aug 2013 22:39:20 +0000 (15:39 -0700)]
rgw: flush pending data when completing multipart part upload
Fixes: #6111
Backport: dumpling
When completing the part upload we need to flush any data that we
aggregated and didn't flush yet. With earlier code didn't have to deal
with it as for multipart upload we didn't have any pending data.
What we do now is we call the regular atomic data completion
function that takes care of it.
When posting an object it is possible to provide a key
name that refers to the original filename, however we
need to verify that in the end we don't end up with an
empty object name.
Yehuda Sadeh [Thu, 22 Aug 2013 00:22:46 +0000 (17:22 -0700)]
rgw: OPTIONS request doesn't need to read object info
This is a bucket-only operation, so we shouldn't look at the
object. Object may not exist and we might respond with Not
Exists response which is not what we want.
Sage Weil [Wed, 28 Aug 2013 22:04:16 +0000 (15:04 -0700)]
osd: initial COPY_FROM (not viable for large objects)
Initial pass at COPY_FROM implementation. This uses COPY_GET to read an
object from another OSD and write it locally. It chunks the read but
accumulates it all in-memory and commits it at once, so it is only suitable
for smaller objects.
Sage Weil [Mon, 26 Aug 2013 23:24:16 +0000 (16:24 -0700)]
objecter, librados: add COPY_FROM operation
This operation will copy an entire object (data, attrs, omap)
atomically. If the src_version does not match the source object, or
the source object is updated while the copy is in progress, we will
fail with a suitable error code. By atomic we mean that it will either
successfully copy the entire object in its entirety or it will fail (and
require no cleanup).
Add to C++ librados API only for now.
Signed-off-by: Sage Weil <sage@inktank.com>
Conflicts:
Yehuda Sadeh [Thu, 29 Aug 2013 20:06:33 +0000 (13:06 -0700)]
rgw: change watch init ordering, don't distribute if can't
Backport: dumpling
Moving back the watch initialization after the zone init,
as the zone info holds the control pool name. Since zone
init might need to create a new system object (that needs
to distribute cache), don't try to distribute cache if
watch is not yet initialized.
Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
osd: provide better version bounds for cls_current_version and ENOENT replies
Following the changes to when we set or increase the user_version, we
want to continue to return the best lower bound we can on the version
of any newly-created object. For ENOENT replies that means returning
info.last_user_version instead of the (potentially-zero) ctx->user_at_version.
Similarly, for cls_current_version we want to return the last version on
the PG rather than the last update to the object in order to provide
sensible version ordering across object deletes and creates.
Update the versions doc so it continues to be precise.
Signed-off-by: Greg Farnum <greg@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Sage Weil [Sat, 31 Aug 2013 00:15:56 +0000 (17:15 -0700)]
osd/PG: only raise PG's last_user_version if entry is >
We may have pg entries that do not increase the user_version at all (i.e.,
they may be 0). Do not update the last_user_version in that case as we
need it to remain an upper bound.
- Added config option to allow S3 to use Keystone auth
- Implemented JSONDecoder for KeystoneToken
- RGW_Auth_S3::authorize now uses rgw_store_user_info on keystone auth
- Minor fix in get_canon_resource; dout is now after the assignment
Reviewed-by: Yehuda Sadeh<yehuda@inktank.com> Signed-off-by: Roald J. van Loon <roaldvanloon@gmail.com>
Sage Weil [Tue, 27 Aug 2013 22:25:50 +0000 (15:25 -0700)]
osd: COPY_GET operation
Add new rados operation to copy all user-visible content for an object
in a simple, safe way. Use a new object_copy_cursor_t to keep track of
our position.
Sage Weil [Sun, 25 Aug 2013 04:58:11 +0000 (21:58 -0700)]
osd/ReplicatedPG: factor {execute,reply}_ctx() out of do_op()
Separate the processing of an OpContext from the preamble and
allocation, so that we can delay the execution for some ops (like the
COPYFROM operation we're about to add).
Sage Weil [Sat, 17 Aug 2013 06:33:06 +0000 (23:33 -0700)]
osd: feed OSDMaps to the Objecter
Feed every map message we see (that isn't discarded for some other
reason) to the Objecter. It has the same continuity requirements that
the OSD has, so it should be satisfied with what we get. It can also
request maps via our MonClient.
Sage Weil [Mon, 26 Aug 2013 20:58:47 +0000 (13:58 -0700)]
osd: discriminate based on connection messenger, not peer type
Replace ->get_source().is_osd() checks and instead see if it is the
cluster_messenger so that we do not confuse ourselves when we get
legit requests from other OSDs on our public interface.