osd: implement basic caching policies in ReplicatedPG
Right now these are very basic and aren't as sophisticated as we
want them to end up, but we have a skeleton for where to put the
decision-making logic.
If we get back a redirect reply, we clean up the Op's external references
and re-send using the target_oloc and target_oid. To facilitate this,
recalc_op_target() now only fills them in and overrides them with pool
cache semantics if they're empty.
Keep in mind that this is a pretty simple redirect formula -- the
Objecter will keep following redirects forever if that's what the OSDs
send back. The client is not providing any synchronization right now.
Objecter: write a helper function to clean up ops that need to be retried
We have a little block to clean them up if we get back EAGAIN, but it's
actually leaking map references; we will also use this for redirects
from the OSDs.
When present, clients must send the request to the location specified
by the redirect (by using the combine_with_locator() function on
request_redirect_t).
A separate mechanism must be used to ensure that clients see and respect
the redirect, as we do not bump up the minimum required version to
decode.
Analagous to the oloc->base_oloc rename we did in e2fcad09d94d965867147627b73e99da9454436f, we may specify a different
target name for a redirect. Rename the existing oid to base_oid to
avoid any confusion.
Greg Farnum [Fri, 30 Aug 2013 23:33:31 +0000 (16:33 -0700)]
ReplicatedPG: do not meaninglessly fill in the reqid on make_writeable() cloning
This reqid is used to fill in a map that is used for giving clients
the correct version on replayed ops, unless the reqid is blank (in
which case it doesn't go into the map). Indirect clones are not ops
that the client wants to observe and the version is incorrect right now,
so don't fill it in.
Note that this should not have actually caused any buggy behavior, because
the map would have simply been overwritten by the real requested event
a short time later (while still protected by locks and things). But it's
very confusing.
Sage Weil [Wed, 28 Aug 2013 22:04:16 +0000 (15:04 -0700)]
osd: initial COPY_FROM (not viable for large objects)
Initial pass at COPY_FROM implementation. This uses COPY_GET to read an
object from another OSD and write it locally. It chunks the read but
accumulates it all in-memory and commits it at once, so it is only suitable
for smaller objects.
Sage Weil [Mon, 26 Aug 2013 23:24:16 +0000 (16:24 -0700)]
objecter, librados: add COPY_FROM operation
This operation will copy an entire object (data, attrs, omap)
atomically. If the src_version does not match the source object, or
the source object is updated while the copy is in progress, we will
fail with a suitable error code. By atomic we mean that it will either
successfully copy the entire object in its entirety or it will fail (and
require no cleanup).
Add to C++ librados API only for now.
Signed-off-by: Sage Weil <sage@inktank.com>
Conflicts:
Yehuda Sadeh [Thu, 29 Aug 2013 20:06:33 +0000 (13:06 -0700)]
rgw: change watch init ordering, don't distribute if can't
Backport: dumpling
Moving back the watch initialization after the zone init,
as the zone info holds the control pool name. Since zone
init might need to create a new system object (that needs
to distribute cache), don't try to distribute cache if
watch is not yet initialized.
Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
osd: provide better version bounds for cls_current_version and ENOENT replies
Following the changes to when we set or increase the user_version, we
want to continue to return the best lower bound we can on the version
of any newly-created object. For ENOENT replies that means returning
info.last_user_version instead of the (potentially-zero) ctx->user_at_version.
Similarly, for cls_current_version we want to return the last version on
the PG rather than the last update to the object in order to provide
sensible version ordering across object deletes and creates.
Update the versions doc so it continues to be precise.
Signed-off-by: Greg Farnum <greg@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Sage Weil [Sat, 31 Aug 2013 00:15:56 +0000 (17:15 -0700)]
osd/PG: only raise PG's last_user_version if entry is >
We may have pg entries that do not increase the user_version at all (i.e.,
they may be 0). Do not update the last_user_version in that case as we
need it to remain an upper bound.
- Added config option to allow S3 to use Keystone auth
- Implemented JSONDecoder for KeystoneToken
- RGW_Auth_S3::authorize now uses rgw_store_user_info on keystone auth
- Minor fix in get_canon_resource; dout is now after the assignment
Reviewed-by: Yehuda Sadeh<yehuda@inktank.com> Signed-off-by: Roald J. van Loon <roaldvanloon@gmail.com>
Sage Weil [Tue, 27 Aug 2013 22:25:50 +0000 (15:25 -0700)]
osd: COPY_GET operation
Add new rados operation to copy all user-visible content for an object
in a simple, safe way. Use a new object_copy_cursor_t to keep track of
our position.
Sage Weil [Sun, 25 Aug 2013 04:58:11 +0000 (21:58 -0700)]
osd/ReplicatedPG: factor {execute,reply}_ctx() out of do_op()
Separate the processing of an OpContext from the preamble and
allocation, so that we can delay the execution for some ops (like the
COPYFROM operation we're about to add).
Sage Weil [Sat, 17 Aug 2013 06:33:06 +0000 (23:33 -0700)]
osd: feed OSDMaps to the Objecter
Feed every map message we see (that isn't discarded for some other
reason) to the Objecter. It has the same continuity requirements that
the OSD has, so it should be satisfied with what we get. It can also
request maps via our MonClient.
Sage Weil [Mon, 26 Aug 2013 20:58:47 +0000 (13:58 -0700)]
osd: discriminate based on connection messenger, not peer type
Replace ->get_source().is_osd() checks and instead see if it is the
cluster_messenger so that we do not confuse ourselves when we get
legit requests from other OSDs on our public interface.
Greg Farnum [Thu, 29 Aug 2013 22:26:08 +0000 (15:26 -0700)]
workunits: add a test for caching redirects
This may need to change since it exploits some of the loose
consistency we currently have with caching pools, but for now
it checks that the Objecter does what we want.
Greg Farnum [Thu, 29 Aug 2013 20:52:35 +0000 (13:52 -0700)]
Objecter: be careful about precalculated pgids
The only current user of the precalc_pgid field is list_objects. That's
fine, but we don't want new users to inadvertently appear and somehow
break the caching/tiering stuff by forcing us to go to the base pool
when we should be talking to somebody else. Add an assert to catch
these cases.
Greg Farnum [Thu, 29 Aug 2013 20:08:03 +0000 (13:08 -0700)]
Objecter: rename Op::oloc -> Op::base_oloc
We want to be able to target other pools for caching and tiering, so
we need to take an oloc from the client and translate it into an
actual target. Rename oloc to base_oloc to make clear which one it is.
Samuel Just [Thu, 29 Aug 2013 22:08:58 +0000 (15:08 -0700)]
PGLog: initialize writeout_from in PGLog constructor
Fixes: 6151
Backport: dumpling Signed-off-by: Samuel Just <sam.just@inktank.com>
Introduced: f808c205c503f7d32518c91619f249466f84c4cf Reviewed-by: Sage Weil <sage@inktank.com>
Sage Weil [Mon, 26 Aug 2013 22:59:54 +0000 (15:59 -0700)]
osd_types: add pg_pool_t cache-related fields
We add fields sufficient to specify
* many pools have a tiering relationship with pool foo
* pool foo is a tier pool for pool bar
* the tiering relationship between foo and bar is specified
by cache_mode
* client reads and writes for pool foo should be directed to
pools bar and baz, respectively (where probably, but not
necessarily, baz == bar or baz == foo).
This lets us specify very sophisticated caching policies on
the server side that all clients going forward can handle
simply by directing the messages as the read_tier and write_tier
flags, and the (not-yet-implemented) redirect replies
from OSDs, specify.
Signed-off-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Farnum <greg@inktank.com>
Sage Weil [Thu, 29 Aug 2013 21:27:46 +0000 (14:27 -0700)]
osd/ReplicatedPG: remove debug lines from snapset_context get/put
The dout() prefix does get_osdmap(), which requires (and asserts) that we
hold the pg lock, but in some cases we do not, notably
ReplicatedPG::object_context_destructor_callback.
Gary Lowell [Thu, 22 Aug 2013 18:07:16 +0000 (11:07 -0700)]
ceph.spec.in: Don't invoke debug_package macro on centos.
If the redhat-rpm-config package is installed, the debuginfo rpms will
be built by default. The build will fail when the package installed
and the specfile also invokes the macro.
Signed-off-by: Gary Lowell <gary.lowell@inktank.com>
Sage Weil [Wed, 28 Aug 2013 23:29:16 +0000 (16:29 -0700)]
osd/ReplicatedPG: set version, user_version correctly on reads
Set the user version to the *current* object version, not the version
we would use if we were to modify it. We move the assignments inside
the reply (read or error) block to make it more obvious which paths
are possible.
Sage Weil [Wed, 28 Aug 2013 16:50:11 +0000 (09:50 -0700)]
mon: discover mon addrs, names during election state too
Currently we only detect new mon addrs and names during the probing phase.
For non-trivial clusters, this means we can get into a sticky spot when
we discover enough peers to form an quorum, but not all of them, and the
undiscovered ones are enough to break the mon ranks and prevent an
election.
One way to work around this is to continue addr and name discovery during
the election. We should also consider making the ranks less sensitive to
the undefined addrs; that is a separate change.
Fixes: #4924
Backport: dumpling Signed-off-by: Sage Weil <sage@inktank.com> Tested-by: Bernhard Glomm <bernhard.glomm@ecologic.eu>
Samuel Just [Tue, 27 Aug 2013 06:19:45 +0000 (23:19 -0700)]
PGLog: move the log size check after the early return
There really are stl implementations (like the one on my ubuntu 12.04
machine) which have a list::size() which is linear in the size of the
list. That assert, therefore, is quite expensive!
Fixes: #6040
Backport: Dumpling Signed-off-by: Samuel Just <sam.just@inktank.com>