Sage Weil [Thu, 23 Jan 2014 17:16:54 +0000 (09:16 -0800)]
osd/OSDMap: do not create erasure rule by default
If we do, we will require the v2 feature bit from clients.
We could only include feature bits for rules that are actually referenced
by pools, but for now making the user create the rule is simpler. There is
no need to create this rule ahead of time.
Sage Weil [Mon, 13 Jan 2014 23:09:27 +0000 (15:09 -0800)]
osd/ReplicatedPG: use get_object_context in trim_object
find_object_context() has all the logic to choose a particular clone given
a logical snap. In the trim case, we want none of that: we just need to
pull the obc for a specific clone instance. Note that this changes
none of the failure cases (previous we asserted r == 0).
Sage Weil [Fri, 10 Jan 2014 00:04:21 +0000 (16:04 -0800)]
osd/ReplicatedPG: update ObjectContext's object_info_t for new hit_set objects
We were fabricating an object_info_t correctly and writing it to disk, but
it was not reflected by the in-memory ObjectContext. If something came
along quickly (like backfill) and tried to use it, the info would be
invalid.
Fix this by fabricating it in the obc and copying it to the new_obs for
the update.
Fixes: #7122 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Thu, 9 Jan 2014 22:49:52 +0000 (14:49 -0800)]
osd/ReplicatedPG: always return ENOENT on deleted snap
Previously, if a snap was deleted but the clone was there and we hadn't
trimmed it yet, we would still return the data. Instead, return ENOENT
unconditionally (even it's not removed yet). This makes the behavior from
the client perspective more predictable and conistent.
Sage Weil [Thu, 9 Jan 2014 10:01:48 +0000 (02:01 -0800)]
ceph_test_rados_api_tier: partial test for promote vs snap trim race
This reliably returns ENODEV due to the test at the finish of flush. Not
because we are actually racing with trim, though: the trimmer doesn't run
at all. I believe it captures the important property, though. Namely:
we should not write a promoted object that is "behind" the snap trimmer's
progress. The fact that we are in front of it (the trimmer hasn't started
yet) should not matter since the object is logically deleted anyway.
We probably want to make the OSD return ENODEV on read in the normal case
when you try to access a clone that is pending trimming.
Sage Weil [Mon, 6 Jan 2014 01:44:49 +0000 (17:44 -0800)]
osd/ReplicatedPG: cleanly abort flush if the object no longer exists
If the object no longer exists (for example, because the snap trimmer just
killed it) clean up the flush state without trying to mark the object
clean.
Previously the caller was generating a temp object name and passing it
down in severaly different ways. Instead, generate one when we realize
that we need it, and store it in *one* place (CopyResults), where
the completions can get at the information.
Sage Weil [Mon, 30 Dec 2013 22:56:54 +0000 (14:56 -0800)]
osd/ReplicatedPG: handle promotion of rollback, src_oids, etc.
Make other find_object_context() callers handle the case where the object
in question needs to be promoted. We add a flag here that forces a promote
for these secondary objects so that the entire operation happens in the
same pool. Forwarding is not allowed in this case.
Sage Weil [Mon, 30 Dec 2013 20:54:03 +0000 (12:54 -0800)]
osd/ReplicatedPG: preserve clean/dirty state on clone
If we have a clean object and clone it in make_writeable(), the clone
should also be clean (it does not need to be written back to the base
pool). If the object was dirty, the clone should be dirty.
Sage Weil [Mon, 30 Dec 2013 20:57:28 +0000 (12:57 -0800)]
osd/ReplicatedPG: infer snaps from head when promoting oldest clean clone
Consider:
- base and cache have same object foo; marked clean in cache pool
- modify + clone foo in cache pool. foo clone is clean.
- foo clone is evicted
- foo clone is read, and promoted
- we read foo@something from base pool, and get the head's content
copy-get does not provide us with a snaps list. Instead, we use the
snap_seq from the head to infer what the snaps vector was in the cache
pool and will be in the base pool when we flush the updates to the object.
Sage Weil [Mon, 30 Dec 2013 19:47:33 +0000 (11:47 -0800)]
osd: include snap_seq in copy-get results
This is needed by the cache layer when reading a logical snap from a head
object on the backend in order to correctly recreate the clone in the
cache layer.
Sage Weil [Fri, 27 Dec 2013 21:42:54 +0000 (13:42 -0800)]
osd/ReplicatedPG: refuse to flush when older dirty clones are present
If the next oldest clone is dirty, we cannot flush. That is, we must
always flush starting with the oldest dirty clone.
Note that we can never have a sequence like dirty -> clean -> dirty,
because clones are only dirty on creation, are created in order, and cannot
be flushed (cleaned) out of order. Thus checking the previous clone is
sufficient (and thankfully cheap).
Sage Weil [Sat, 28 Dec 2013 00:11:27 +0000 (16:11 -0800)]
osd/ReplicatedPG: allow cache-evict on snaps
We do three things here:
- make cache-evict a CACHE instead of WR op, allowing us to submit it
on snaps (not just head)
- allow eviction of a snap
- verify that all snaps are missing before evicting a head
Sage Weil [Fri, 27 Dec 2013 19:15:19 +0000 (11:15 -0800)]
osd: add rados CACHE mode (different from RD and WR)
It is useful to distinguish cache operations from read and modify
operations. Specifically, we will allow cache ops to be sent for
snaps and also allow those ops to result in a write.
Sage Weil [Fri, 27 Dec 2013 01:32:43 +0000 (17:32 -0800)]
osd/ReplicatedPG: update snap_mapper for promoted clones
A clone that comes into existence via promotion takes an entirely
different path than a typical clone (which comes into existence via a
CLONE op in make_writeable()). Make sure snap_mapper is updated
accordingly.
Sage Weil [Thu, 26 Dec 2013 17:19:08 +0000 (09:19 -0800)]
osd/ReplicatedPG: always encode snaps in finish_ctx
On promote we use finish_ctx to build the final log entries, and need to
encode the snaps vector in that case. (Normally this is done by
make_writeable or explicitly by the snap trimmer.)
Sage Weil [Fri, 27 Dec 2013 02:05:22 +0000 (18:05 -0800)]
osd/ReplicatedPG: mirror SnapSet info when promoting head
When we promote the head for an object, get the list of snaps from the
backend pool and construct an appropriate SnapSet. Note that this is
always placed on the head in the cache pool, since we will have a
whiteout object in this case.
Also note that the SnapSet's list of snapids will not include any snaps
for which there were no clones. This is fine, since it is only used for
creating clones, and we've already done that.
Sage Weil [Tue, 24 Dec 2013 16:50:38 +0000 (08:50 -0800)]
osd/ReplicatedPG: include snaps in copy-get results
When promoting a snapped object, we need to also get the set of snaps over
which the clone is defined. This is not strictly available except via the
list-snaps rados call, but that is only used on the snapdir object much
earlier when the head (whiteout) is promoted, and is not conveniently
available now. Adding it to the internal copy-get is not exposed via
librados (copy-get is not exposed at all) so I don't think this is a
problem.
Sage Weil [Tue, 24 Dec 2013 01:25:07 +0000 (17:25 -0800)]
osd/ReplicatedPG: make find_object_context() pass missing_oid
Prevoiusly we would return a snapid that we are blocked on if it is
missing. This is necessary because the missing clone does not always
match the logical snap we are trying to read.
Extend this to return a full hobject_t that is the missing object we want.
For the missing clone case, this cleans things up slightly. More
importantly, it lets find_object_context also tell us which on-disk
object is missing that, if it could be promoted, would help.
Sage Weil [Mon, 13 Jan 2014 23:50:29 +0000 (15:50 -0800)]
buffer: do not append trailing newline when appending empty istream
If we call
bl.append(some_istream);
do not include a \n if the istream is empty (which is apparently is not
the same thing as eof). This was causing 'ceph pg getmap' to include a
trailing newline.
Probably we don't want this newline at all! But all callers need to be
fixed for that change.
David Zafman [Thu, 21 Nov 2013 23:21:53 +0000 (15:21 -0800)]
osd: Interim backfill changes
Make peer_backfill_info a map which holds a
BackfillInterval for all backfill targets.
Initially see if recover_backfill() can just backfill
the first one and mark them all finished.
Signed-off-by: David Zafman <david.zafman@inktank.com>
Loic Dachary [Sun, 12 Jan 2014 16:34:52 +0000 (17:34 +0100)]
doc: update the crushtool manual page
* add information about CEPH_ARGS
* rework the --build documentation and example
* add an Author section
* replace vi with emacs for no good reason
* cleanup whitespace
Loic Dachary [Sun, 12 Jan 2014 16:24:39 +0000 (17:24 +0100)]
crush: crushtool --build informative messages
* dump the crush tree created by --build at debug level 1.
* display a warning at debug level 1 if there is more than one root. In
most cases it is not what the user wants and it may be confusing
because the ruleset will only apply to the first of root and have less
devices under it as expected.
Loic Dachary [Sat, 11 Jan 2014 10:19:51 +0000 (11:19 +0100)]
crush: display args on crushtool failure
When the number of args provided to --build is not a multiple of 3,
display the arguments which do not comply.
For instance the --debug_crush 0 option is not consumed by global_init
in crushtool because, unlike most ceph tools, the arguments are not
passed to global_init. As a result --debug_crush 0 become part of the
arguments and triggers the failure.
crushtool --debug_crush 0 --build --num_osds 320 node straw 4
remaining args: [--debug_crush,0,node,straw,4]
layers must be specified with 3-tuples of (name, buckettype, size)
Loic Dachary [Sat, 11 Jan 2014 10:46:57 +0000 (11:46 +0100)]
crush: parse CEPH_ARGS in crushtool
The arguments are not given to global_init because the -c option would
conflict. Reading arguments from CEPH_ARGS the way other ceph tools do
is the only way to control verbosity ( via --debug_crush 0 for instance ).
Loic Dachary [Sun, 12 Jan 2014 12:47:01 +0000 (13:47 +0100)]
osd: ostream is enough for build_simple*
There is no need to specialize the argument into stringstream. It is
replaced by a ostream which is convenient to display errors directly to
cerr if appropriate.
Sage Weil [Sat, 28 Dec 2013 20:23:22 +0000 (12:23 -0800)]
mds: require CEPH_FEATURE_OSD_TMAP2OMAP
Require that all OSDs support TMAP2OMAP before starting the MDS. This
avoids doing some work and then crashing with EOPNOTSUPP, and gives us
a more informative message in the logs.
Yan, Zheng [Tue, 24 Dec 2013 00:56:55 +0000 (08:56 +0800)]
mds: use OMAP to store dirfrags
MDS can fetch dirfrags from both TMAP and OMAP. When committing a
dirfrags that is stored in TMAP, MDS first uses OSD_OP_TMAP2OMAP
to convert corresponding TMAP to OMAP, then updates the resulting
OMAP.
Samuel Just [Fri, 10 Jan 2014 21:23:32 +0000 (13:23 -0800)]
os/DBObjectMap, FileStore: omap_clear should not remove xattrs
Prevously, FileStore::_omap_clear() used ObjectMap::clear(), which
incorrectly also blasts any stored xattrs. Instead, add
ObjectMap::clear_keys_header() to handle this case efficiently.
Fixes: #7065 Fixes: #7135 Signed-off-by: Samuel Just <sam.just@inktank.com>
Loic Dachary [Fri, 10 Jan 2014 16:49:21 +0000 (17:49 +0100)]
organizationmap: match authors with organizations
Using the same format as .mailmap, match author names with the
organization sponsoring them, if any. It can be used from the command
line to display git log statistics with results aggregated by company
names.
The git-check-mailmap command that was introduced in git 1.8.4 can be
used to use .mailmap first and then .organizationmap using the
normalized author names. For instance: