Analagous to the oloc->base_oloc rename we did in e2fcad09d94d965867147627b73e99da9454436f, we may specify a different
target name for a redirect. Rename the existing oid to base_oid to
avoid any confusion.
Greg Farnum [Fri, 30 Aug 2013 23:33:31 +0000 (16:33 -0700)]
ReplicatedPG: do not meaninglessly fill in the reqid on make_writeable() cloning
This reqid is used to fill in a map that is used for giving clients
the correct version on replayed ops, unless the reqid is blank (in
which case it doesn't go into the map). Indirect clones are not ops
that the client wants to observe and the version is incorrect right now,
so don't fill it in.
Note that this should not have actually caused any buggy behavior, because
the map would have simply been overwritten by the real requested event
a short time later (while still protected by locks and things). But it's
very confusing.
Sage Weil [Wed, 28 Aug 2013 22:04:16 +0000 (15:04 -0700)]
osd: initial COPY_FROM (not viable for large objects)
Initial pass at COPY_FROM implementation. This uses COPY_GET to read an
object from another OSD and write it locally. It chunks the read but
accumulates it all in-memory and commits it at once, so it is only suitable
for smaller objects.
Sage Weil [Mon, 26 Aug 2013 23:24:16 +0000 (16:24 -0700)]
objecter, librados: add COPY_FROM operation
This operation will copy an entire object (data, attrs, omap)
atomically. If the src_version does not match the source object, or
the source object is updated while the copy is in progress, we will
fail with a suitable error code. By atomic we mean that it will either
successfully copy the entire object in its entirety or it will fail (and
require no cleanup).
Add to C++ librados API only for now.
Signed-off-by: Sage Weil <sage@inktank.com>
Conflicts:
Yehuda Sadeh [Thu, 29 Aug 2013 20:06:33 +0000 (13:06 -0700)]
rgw: change watch init ordering, don't distribute if can't
Backport: dumpling
Moving back the watch initialization after the zone init,
as the zone info holds the control pool name. Since zone
init might need to create a new system object (that needs
to distribute cache), don't try to distribute cache if
watch is not yet initialized.
Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
osd: provide better version bounds for cls_current_version and ENOENT replies
Following the changes to when we set or increase the user_version, we
want to continue to return the best lower bound we can on the version
of any newly-created object. For ENOENT replies that means returning
info.last_user_version instead of the (potentially-zero) ctx->user_at_version.
Similarly, for cls_current_version we want to return the last version on
the PG rather than the last update to the object in order to provide
sensible version ordering across object deletes and creates.
Update the versions doc so it continues to be precise.
Signed-off-by: Greg Farnum <greg@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Sage Weil [Sat, 31 Aug 2013 00:15:56 +0000 (17:15 -0700)]
osd/PG: only raise PG's last_user_version if entry is >
We may have pg entries that do not increase the user_version at all (i.e.,
they may be 0). Do not update the last_user_version in that case as we
need it to remain an upper bound.
- Added config option to allow S3 to use Keystone auth
- Implemented JSONDecoder for KeystoneToken
- RGW_Auth_S3::authorize now uses rgw_store_user_info on keystone auth
- Minor fix in get_canon_resource; dout is now after the assignment
Reviewed-by: Yehuda Sadeh<yehuda@inktank.com> Signed-off-by: Roald J. van Loon <roaldvanloon@gmail.com>
Sage Weil [Tue, 27 Aug 2013 22:25:50 +0000 (15:25 -0700)]
osd: COPY_GET operation
Add new rados operation to copy all user-visible content for an object
in a simple, safe way. Use a new object_copy_cursor_t to keep track of
our position.
Sage Weil [Sun, 25 Aug 2013 04:58:11 +0000 (21:58 -0700)]
osd/ReplicatedPG: factor {execute,reply}_ctx() out of do_op()
Separate the processing of an OpContext from the preamble and
allocation, so that we can delay the execution for some ops (like the
COPYFROM operation we're about to add).
Sage Weil [Sat, 17 Aug 2013 06:33:06 +0000 (23:33 -0700)]
osd: feed OSDMaps to the Objecter
Feed every map message we see (that isn't discarded for some other
reason) to the Objecter. It has the same continuity requirements that
the OSD has, so it should be satisfied with what we get. It can also
request maps via our MonClient.
Sage Weil [Mon, 26 Aug 2013 20:58:47 +0000 (13:58 -0700)]
osd: discriminate based on connection messenger, not peer type
Replace ->get_source().is_osd() checks and instead see if it is the
cluster_messenger so that we do not confuse ourselves when we get
legit requests from other OSDs on our public interface.
Greg Farnum [Thu, 29 Aug 2013 22:26:08 +0000 (15:26 -0700)]
workunits: add a test for caching redirects
This may need to change since it exploits some of the loose
consistency we currently have with caching pools, but for now
it checks that the Objecter does what we want.
Greg Farnum [Thu, 29 Aug 2013 20:52:35 +0000 (13:52 -0700)]
Objecter: be careful about precalculated pgids
The only current user of the precalc_pgid field is list_objects. That's
fine, but we don't want new users to inadvertently appear and somehow
break the caching/tiering stuff by forcing us to go to the base pool
when we should be talking to somebody else. Add an assert to catch
these cases.
Greg Farnum [Thu, 29 Aug 2013 20:08:03 +0000 (13:08 -0700)]
Objecter: rename Op::oloc -> Op::base_oloc
We want to be able to target other pools for caching and tiering, so
we need to take an oloc from the client and translate it into an
actual target. Rename oloc to base_oloc to make clear which one it is.
Samuel Just [Thu, 29 Aug 2013 22:08:58 +0000 (15:08 -0700)]
PGLog: initialize writeout_from in PGLog constructor
Fixes: 6151
Backport: dumpling Signed-off-by: Samuel Just <sam.just@inktank.com>
Introduced: f808c205c503f7d32518c91619f249466f84c4cf Reviewed-by: Sage Weil <sage@inktank.com>
Sage Weil [Mon, 26 Aug 2013 22:59:54 +0000 (15:59 -0700)]
osd_types: add pg_pool_t cache-related fields
We add fields sufficient to specify
* many pools have a tiering relationship with pool foo
* pool foo is a tier pool for pool bar
* the tiering relationship between foo and bar is specified
by cache_mode
* client reads and writes for pool foo should be directed to
pools bar and baz, respectively (where probably, but not
necessarily, baz == bar or baz == foo).
This lets us specify very sophisticated caching policies on
the server side that all clients going forward can handle
simply by directing the messages as the read_tier and write_tier
flags, and the (not-yet-implemented) redirect replies
from OSDs, specify.
Signed-off-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Farnum <greg@inktank.com>
Sage Weil [Thu, 29 Aug 2013 21:27:46 +0000 (14:27 -0700)]
osd/ReplicatedPG: remove debug lines from snapset_context get/put
The dout() prefix does get_osdmap(), which requires (and asserts) that we
hold the pg lock, but in some cases we do not, notably
ReplicatedPG::object_context_destructor_callback.
Gary Lowell [Thu, 22 Aug 2013 18:07:16 +0000 (11:07 -0700)]
ceph.spec.in: Don't invoke debug_package macro on centos.
If the redhat-rpm-config package is installed, the debuginfo rpms will
be built by default. The build will fail when the package installed
and the specfile also invokes the macro.
Signed-off-by: Gary Lowell <gary.lowell@inktank.com>
Sage Weil [Wed, 28 Aug 2013 23:29:16 +0000 (16:29 -0700)]
osd/ReplicatedPG: set version, user_version correctly on reads
Set the user version to the *current* object version, not the version
we would use if we were to modify it. We move the assignments inside
the reply (read or error) block to make it more obvious which paths
are possible.
Sage Weil [Wed, 28 Aug 2013 16:50:11 +0000 (09:50 -0700)]
mon: discover mon addrs, names during election state too
Currently we only detect new mon addrs and names during the probing phase.
For non-trivial clusters, this means we can get into a sticky spot when
we discover enough peers to form an quorum, but not all of them, and the
undiscovered ones are enough to break the mon ranks and prevent an
election.
One way to work around this is to continue addr and name discovery during
the election. We should also consider making the ranks less sensitive to
the undefined addrs; that is a separate change.
Fixes: #4924
Backport: dumpling Signed-off-by: Sage Weil <sage@inktank.com> Tested-by: Bernhard Glomm <bernhard.glomm@ecologic.eu>
Samuel Just [Tue, 27 Aug 2013 06:19:45 +0000 (23:19 -0700)]
PGLog: move the log size check after the early return
There really are stl implementations (like the one on my ubuntu 12.04
machine) which have a list::size() which is linear in the size of the
list. That assert, therefore, is quite expensive!
Fixes: #6040
Backport: Dumpling Signed-off-by: Samuel Just <sam.just@inktank.com>
Greg Farnum [Wed, 28 Aug 2013 00:24:24 +0000 (17:24 -0700)]
ReplicatedPG: add OpContext::user_at_version
Set this up with the existing at_version member, but only increase
it for user_modify ops. Use this when logging the PG's user_version. In
order to maintain compatibility with old clients on classic pools, we
force user_version to follow at_version whenever it's updated.
Now that we have and are maintaining this PG user version, use it
for the user version on ops that get ENOENT back, when short-circuiting
replies as part of reply_op_error()[1], or when replying to repops
in eval_repop; further use it for the cls_current_version() function. This
is a small semantic change for that function, as previously it would
generally return the same value as the user would get sent back via
MOSDOpReply -- but I don't think it was something you could count on.
We now define it as being the user version of the PG at the start of the
op, and as a bonus it is defined even for read ops (the at_version is
only filled in on write operations).
[1]: We tweak PGLog to make it easier to retrieve both user and PG versions.
Greg Farnum [Tue, 27 Aug 2013 18:02:44 +0000 (11:02 -0700)]
MOSDOpReply: add enough fields to be backwards compatible.
The system we've been building up works out very nicely for new clients,
but they could not have interoperated with old clients that were only
referring to our replay_version. In order to deal with this, we add
a bad_replay_version to MOSDOpReply which is encoded where we used
to encode replay_version. bad_replay_version will follow the same semantics
as reassert_version used to (except that it is filled in on reads), but
is not accessible to new clients, who can see only our properly-controlled
replay_version and user_version. This will let old and new clients
interoperate correctly when communicating about watches, etc.