Greg Farnum [Wed, 28 Aug 2013 00:24:24 +0000 (17:24 -0700)]
ReplicatedPG: add OpContext::user_at_version
Set this up with the existing at_version member, but only increase
it for user_modify ops. Use this when logging the PG's user_version. In
order to maintain compatibility with old clients on classic pools, we
force user_version to follow at_version whenever it's updated.
Now that we have and are maintaining this PG user version, use it
for the user version on ops that get ENOENT back, when short-circuiting
replies as part of reply_op_error()[1], or when replying to repops
in eval_repop; further use it for the cls_current_version() function. This
is a small semantic change for that function, as previously it would
generally return the same value as the user would get sent back via
MOSDOpReply -- but I don't think it was something you could count on.
We now define it as being the user version of the PG at the start of the
op, and as a bonus it is defined even for read ops (the at_version is
only filled in on write operations).
[1]: We tweak PGLog to make it easier to retrieve both user and PG versions.
Greg Farnum [Tue, 27 Aug 2013 18:02:44 +0000 (11:02 -0700)]
MOSDOpReply: add enough fields to be backwards compatible.
The system we've been building up works out very nicely for new clients,
but they could not have interoperated with old clients that were only
referring to our replay_version. In order to deal with this, we add
a bad_replay_version to MOSDOpReply which is encoded where we used
to encode replay_version. bad_replay_version will follow the same semantics
as reassert_version used to (except that it is filled in on reads), but
is not accessible to new clients, who can see only our properly-controlled
replay_version and user_version. This will let old and new clients
interoperate correctly when communicating about watches, etc.
Greg Farnum [Wed, 28 Aug 2013 00:14:56 +0000 (17:14 -0700)]
osd: actually fill in user_version in pg_log_entry_t
We now require it when creating a pg_log_entry_t. The user_version
is the version which info.last_user_version should be set to
after the transaction is applied, which for everything except for
a user-modify op is going to be the version it was already at.
For now we are filling in the user-modify op's changing user_version
to be ctx->at_version.version
Greg Farnum [Wed, 21 Aug 2013 18:26:28 +0000 (11:26 -0700)]
osd: add last_user_version to pg_info_t
We add a corresponding user_version to pg_log_entry_t, and the logic
to assign from one to the other and to recover last_user_version from
a master's log. We aren't yet setting it to anything, though.
ctx->new_obs.oi.user_version is initialized to ctx->obs.oi.user_version,
and for read ops it won't be changed. That means
reply_user_version == ctx->new_obs.oi.user_version in all cases, which
means we don't want it.
Greg Farnum [Wed, 21 Aug 2013 00:13:53 +0000 (17:13 -0700)]
osd: switch object_info_t::user_version to be a version_t
We never expose the full eversion_t data to users, and do not want to.
However, we pull some tricks in the encode/decode functions to avoid
having to change the object_info_t disk format for this change.
When we can break compatibility, we should simplify this.
Greg Farnum [Tue, 20 Aug 2013 23:22:27 +0000 (16:22 -0700)]
ReplicatedPG: Fill in the MOSDOpReply's user_version
As part of this, rename OpContext::reply_version->reply_user_version.
The semantics that necessitate the reply_version are only for user versions,
so rename it for clarity. Then use the reply_user_version in
set_user_version() (if the op succeeded).
For now we use the PG version for ENOENT (preserving the previous
semantics), but that will get changed to the pg's user_version soon
as well.
Greg Farnum [Tue, 20 Aug 2013 21:21:04 +0000 (14:21 -0700)]
Objecter: librados: mass switch from eversion_t to version_t
There are a lot of pointers throughout our request infrastructure used solely
for exporting the version to users. The interfaces we actually expose only
provide a uint64_t (leaving off eversion_t's epoch), and that's all we're
going to maintain in our new user_version scheme, so don't pretend we'll
have more in our internal interfaces.
I audited this pretty carefully; in particular:
Op::objver is only used for passing data back to users via the calling
functions IoCtxImpl::last_objver, etc
IoCtxImpl::last_objver is used only for the set_sync_op_version() call, which
provides data only for the uint64_t get_last_version() and
rados_get_last_version() calls.
AioCompletionImpl::objver is used only for the uint64_t get_version() call.
LingerOp::pobjver is used only for referencing things that are now version_t.
Greg Farnum [Thu, 22 Aug 2013 21:54:19 +0000 (14:54 -0700)]
ReplicatedPG: do not do a redundant set of ctx->new_obs.oi.version
We set this in the if below for writes, and for reads it doesn't need to
be updated (and isn't). Remove the confusing double-set so future code
inspectors don't get concerned there's a bug like I did.
Greg Farnum [Mon, 26 Aug 2013 21:38:30 +0000 (14:38 -0700)]
ReplicatedPG: remove long-dead branch
This was confusing the heck out of me when trying to figure out
why I was hitting an assert. So replace the if-else block with
a more appropriate assert and don't include any misleading calls
to prepare_transaction() from sub_op_modify().
We have been returning the object's "user version" and using that
for replay, but that is in fact incorrect. In preparation for fixing
up the user version semantics, rename get_version to get_replay_version
and set_version to set_replay_version.
Loic Dachary [Tue, 20 Aug 2013 14:17:10 +0000 (16:17 +0200)]
erasure code : plugin, interface and glossary documentation updates
* replace the erasure code plugin abstract interface with a doxygen link
that will be populated when the header shows in master
* update the plugin documentation to reflect the current draft implementation
* fix broken link to PGBackend-h
* add a glossary to define chunk, stripe, shard and strip with a drawing
Sage Weil [Mon, 19 Aug 2013 05:34:24 +0000 (22:34 -0700)]
osd/ReplicatedPG: remove broken AccessMode logic
The original intent here was to handle reads in two modes. For
workloads with read/modify/write ops, the RMW mode would:
- queue writes for local store and replicas immediately
- block reads until the write commits to all replicas
For mixed read/write workloads without read/modify/write ops, the
DELAYED mode would:
- queue writes for replicas
- allow local reads
- once replicas commit, queue write locally
- block local reads until local write completes
In reality, we never use the DELAYED mode. It's untested and possibly
broken, and it is unlikely we will see a workload where it is important
in the near to mid term.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
Greg Farnum [Mon, 19 Aug 2013 17:29:49 +0000 (10:29 -0700)]
librados: synchronous commands should return on commit instead of ack
This is unlikely to be noticed by anybody, but it is a big change. Document
in the PendingReleaseNotes and bump up the librados minor version number
to 68.
Greg Farnum [Mon, 19 Aug 2013 17:21:16 +0000 (10:21 -0700)]
mon: make MonMap error message about unspecified monitors less specific.
The error message helpfully references the -m and -c CLI options for
specifying monitors, but this code can be invoked from non-core librados
client applications so that's unfortunately not kosher. Remove the
reference.
Sage Weil [Sat, 17 Aug 2013 21:30:37 +0000 (14:30 -0700)]
objclass: move cls_log into class_api.cc
Not sure why but this seems to resolve a linking problem when loading
classes:
2013-08-17 13:28:19.015776 7fb2bcffa700 0 _load_class could not open class /usr/lib/rados-classes/libcls_hello.so (dlopen failed): /usr/lib/rados-classes/libcls_hello.so: undefined symbol: cls_log
2013-08-17 13:28:19.015786 7fb2bcffa700 -1 osd.4 12 class hello open got (5) Input/output error
Sage Weil [Sat, 17 Aug 2013 05:08:00 +0000 (22:08 -0700)]
mds: create only one ESubtreeMap during fs creation
Previously we would create an empty ESubtreeMap when we opened the log
segment and then immediately journal a second one that created the root
and mdsdir. More importantly, for the second ESubtreeMap, we would not
wait for it to commit before requesting the ACTIVE state, leading to
#4894.
Instead, break start_new_segment() into two steps: one that creates the
in-memory LogSegment tracking structure, and one that journals the
ESubtreeMap. Open things early and write the (one) ESubtreeMap at the
end of boot_create().. and then wait for it.
Fixes: #4894 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
Sage Weil [Sat, 17 Aug 2013 00:59:11 +0000 (17:59 -0700)]
ceph-post-file: single command to upload a file to cephdrop
Use sftp to upload to a directory that only this user and ceph devs can
access.
Distribute an ssh key to connect to the account. This will let us revoke
the key in the future if we feel the need. Also distribute a known_hosts
file so that users have some confidence that they are connecting to the
real ceph drop account and not some third party.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Dan Mick <dan.mick@inktank.com>
Sage Weil [Thu, 15 Aug 2013 23:19:21 +0000 (16:19 -0700)]
osd: enforce RD, WR flags for class methods
Class methods are marked with RD and WR to help the OSD decide when we need
to flush objects or require certain permissions. Ensure that methods do
not step outside their advertised capabilities by keeping a counter of rd
and wr ops we perform in do_osd_ops() and making sure that class methods,
and any ops the indirectly call, do not break the rules.
Sage Weil [Thu, 15 Aug 2013 22:22:41 +0000 (15:22 -0700)]
cls_rbd: remove old assign_bid method
This method is problematic because it both writes/mutates and returns data,
which means that an untimely client disconnect or peering event will result
in a success to the client with no payload.
It has not been used since v0.52 (18054ba46fe2779d8df8b1a0d69ec93ca6a66c34)
which is pre-bobtail; so this change breaks compatibility with pre-bobtail
librbd clients (at least for image creation).
Sage Weil [Thu, 15 Aug 2013 22:06:38 +0000 (15:06 -0700)]
osd: do not return data payload for successful writes
We were somewhat inadvertantly returning a data payload for write
operations. This was a side-effect of the OpContext::ops field being a
reference to MOSDOp::ops: the return data would end up there, and then
the MOSDOpReply ctor would copy it.
Fix this by breaking the ref, and making the do_op() logic also claim
return result data for error values (so that errors can return data to the
caller).
Sage Weil [Thu, 15 Aug 2013 21:35:28 +0000 (14:35 -0700)]
common/Preforker: shut up warning
common/Preforker.h: In member function 'void Preforker::daemonize()':
common/Preforker.h:97:40: warning: ignoring return value of 'ssize_t write(int, const void*, size_t)', declared with attribute warn_unused_result [-Wunused-result]
Sage Weil [Thu, 15 Aug 2013 21:36:57 +0000 (14:36 -0700)]
config: fix stringification of config values
The std::copy construct leaves a trailing separator character, which breaks
parsing for booleans (among other things) and probably mangles everything
else too.
Backport: dumpling Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>