Sage Weil [Fri, 27 Dec 2013 19:15:19 +0000 (11:15 -0800)]
osd: add rados CACHE mode (different from RD and WR)
It is useful to distinguish cache operations from read and modify
operations. Specifically, we will allow cache ops to be sent for
snaps and also allow those ops to result in a write.
Sage Weil [Fri, 27 Dec 2013 01:32:43 +0000 (17:32 -0800)]
osd/ReplicatedPG: update snap_mapper for promoted clones
A clone that comes into existence via promotion takes an entirely
different path than a typical clone (which comes into existence via a
CLONE op in make_writeable()). Make sure snap_mapper is updated
accordingly.
Sage Weil [Thu, 26 Dec 2013 17:19:08 +0000 (09:19 -0800)]
osd/ReplicatedPG: always encode snaps in finish_ctx
On promote we use finish_ctx to build the final log entries, and need to
encode the snaps vector in that case. (Normally this is done by
make_writeable or explicitly by the snap trimmer.)
Sage Weil [Fri, 27 Dec 2013 02:05:22 +0000 (18:05 -0800)]
osd/ReplicatedPG: mirror SnapSet info when promoting head
When we promote the head for an object, get the list of snaps from the
backend pool and construct an appropriate SnapSet. Note that this is
always placed on the head in the cache pool, since we will have a
whiteout object in this case.
Also note that the SnapSet's list of snapids will not include any snaps
for which there were no clones. This is fine, since it is only used for
creating clones, and we've already done that.
Sage Weil [Tue, 24 Dec 2013 16:50:38 +0000 (08:50 -0800)]
osd/ReplicatedPG: include snaps in copy-get results
When promoting a snapped object, we need to also get the set of snaps over
which the clone is defined. This is not strictly available except via the
list-snaps rados call, but that is only used on the snapdir object much
earlier when the head (whiteout) is promoted, and is not conveniently
available now. Adding it to the internal copy-get is not exposed via
librados (copy-get is not exposed at all) so I don't think this is a
problem.
Sage Weil [Tue, 24 Dec 2013 01:25:07 +0000 (17:25 -0800)]
osd/ReplicatedPG: make find_object_context() pass missing_oid
Prevoiusly we would return a snapid that we are blocked on if it is
missing. This is necessary because the missing clone does not always
match the logical snap we are trying to read.
Extend this to return a full hobject_t that is the missing object we want.
For the missing clone case, this cleans things up slightly. More
importantly, it lets find_object_context also tell us which on-disk
object is missing that, if it could be promoted, would help.
David Zafman [Thu, 21 Nov 2013 23:21:53 +0000 (15:21 -0800)]
osd: Interim backfill changes
Make peer_backfill_info a map which holds a
BackfillInterval for all backfill targets.
Initially see if recover_backfill() can just backfill
the first one and mark them all finished.
Signed-off-by: David Zafman <david.zafman@inktank.com>
Loic Dachary [Sun, 12 Jan 2014 16:34:52 +0000 (17:34 +0100)]
doc: update the crushtool manual page
* add information about CEPH_ARGS
* rework the --build documentation and example
* add an Author section
* replace vi with emacs for no good reason
* cleanup whitespace
Loic Dachary [Sun, 12 Jan 2014 16:24:39 +0000 (17:24 +0100)]
crush: crushtool --build informative messages
* dump the crush tree created by --build at debug level 1.
* display a warning at debug level 1 if there is more than one root. In
most cases it is not what the user wants and it may be confusing
because the ruleset will only apply to the first of root and have less
devices under it as expected.
Loic Dachary [Sat, 11 Jan 2014 10:19:51 +0000 (11:19 +0100)]
crush: display args on crushtool failure
When the number of args provided to --build is not a multiple of 3,
display the arguments which do not comply.
For instance the --debug_crush 0 option is not consumed by global_init
in crushtool because, unlike most ceph tools, the arguments are not
passed to global_init. As a result --debug_crush 0 become part of the
arguments and triggers the failure.
crushtool --debug_crush 0 --build --num_osds 320 node straw 4
remaining args: [--debug_crush,0,node,straw,4]
layers must be specified with 3-tuples of (name, buckettype, size)
Loic Dachary [Sat, 11 Jan 2014 10:46:57 +0000 (11:46 +0100)]
crush: parse CEPH_ARGS in crushtool
The arguments are not given to global_init because the -c option would
conflict. Reading arguments from CEPH_ARGS the way other ceph tools do
is the only way to control verbosity ( via --debug_crush 0 for instance ).
Loic Dachary [Sun, 12 Jan 2014 12:47:01 +0000 (13:47 +0100)]
osd: ostream is enough for build_simple*
There is no need to specialize the argument into stringstream. It is
replaced by a ostream which is convenient to display errors directly to
cerr if appropriate.
Sage Weil [Sat, 28 Dec 2013 20:23:22 +0000 (12:23 -0800)]
mds: require CEPH_FEATURE_OSD_TMAP2OMAP
Require that all OSDs support TMAP2OMAP before starting the MDS. This
avoids doing some work and then crashing with EOPNOTSUPP, and gives us
a more informative message in the logs.
Yan, Zheng [Tue, 24 Dec 2013 00:56:55 +0000 (08:56 +0800)]
mds: use OMAP to store dirfrags
MDS can fetch dirfrags from both TMAP and OMAP. When committing a
dirfrags that is stored in TMAP, MDS first uses OSD_OP_TMAP2OMAP
to convert corresponding TMAP to OMAP, then updates the resulting
OMAP.
Samuel Just [Fri, 10 Jan 2014 21:23:32 +0000 (13:23 -0800)]
os/DBObjectMap, FileStore: omap_clear should not remove xattrs
Prevously, FileStore::_omap_clear() used ObjectMap::clear(), which
incorrectly also blasts any stored xattrs. Instead, add
ObjectMap::clear_keys_header() to handle this case efficiently.
Fixes: #7065 Fixes: #7135 Signed-off-by: Samuel Just <sam.just@inktank.com>
Loic Dachary [Fri, 10 Jan 2014 16:49:21 +0000 (17:49 +0100)]
organizationmap: match authors with organizations
Using the same format as .mailmap, match author names with the
organization sponsoring them, if any. It can be used from the command
line to display git log statistics with results aggregated by company
names.
The git-check-mailmap command that was introduced in git 1.8.4 can be
used to use .mailmap first and then .organizationmap using the
normalized author names. For instance:
Greg Farnum [Thu, 9 Jan 2014 22:03:12 +0000 (14:03 -0800)]
FileStore: detect XFS properly
We were only setting m_fs_type = FS_TYPE_XFS if
m_filestore_replica_fadvise was also set -- presumably
the bug fix accidentally blocked off too much of the code type. This
resulted in our xattr counts always being set too low: the store
is mounted (and thus does _detectfs) twice; once in as part of the
not-as-conditional-as-it-looks convertfs in ceph_osd.cc, and once
as part of OSD::init().
Sage Weil [Thu, 9 Jan 2014 22:44:49 +0000 (14:44 -0800)]
mon: set next commit in mon command replies
The mon command acks include a version that is used by the client to
determine which version of the map they need to get or wait for in order
to see the effects of their command. Current we are returning
get_last_committed() everywhere, but we are about to commit something (and
waiting for it), which will increase that value by one. As a result,
clients are always getting epoch/version-1 instead of epoch.
This manifested by a LibRadosTier.Promote test that failed becaues the
OSD had the OSDMap updates adding the tier and overlay but not the final
map change that set the cache-mode to writeback. I suspect this is also
the cause of of spurious errors in the past where we've seen misdirected
request errors that made no sense.
Backport: emperor, dumpling Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Joao Eduardo Luis <joao@inktank.com>
Yehuda Sadeh [Tue, 7 Jan 2014 02:32:42 +0000 (18:32 -0800)]
rgw: convert bucket info if needed
Fixes: #7110
In dumpling, the bucket info was separated into bucket entry point and
bucket instance objects. When setting bucket attrs we only ended up
updating the bucket instance object. However, pre-dumpling buckets still
keep everything at the entry-point object, so acl changes didn't affect
anything (because we never updated the entry point). This change just
converts the bucket info into the new format.
Ken Dreyer [Thu, 9 Jan 2014 15:55:28 +0000 (08:55 -0700)]
remove spurious executable permissions on files
Fedora's rpmlint complains that some of the source code files in the
tree happen to be executable. Remove the execute bits from these files
to resolve the rpmlint warning.
Signed-off-by: Ken Dreyer <ken.dreyer@inktank.com>
David Zafman [Wed, 11 Dec 2013 01:29:48 +0000 (17:29 -0800)]
osd: Fix problems in ReplicatedPG::do_op() logic
Fix assert(is_degraded_object(soid)) in ReplicatedPG::wait_for_degraded_object()
Use last_backfill_started as the backfill line
Handle uncommon case of multi op source after backfill line and target before
backfill line and !is_degraded_object().
Include backfill line itself for before_backfill (<= instead of <)
Signed-off-by: David Zafman <david.zafman@inktank.com>
David Zafman [Thu, 19 Dec 2013 18:35:39 +0000 (10:35 -0800)]
osd: Remove redundant incompat feature
We can remove this CompatSet bit without worry because the only
way it could have been set is if an erasure coded pool was create.
This isn't supported as of yet.
Signed-off-by: David Zafman <david.zafman@inktank.com>
Loic Dachary [Wed, 8 Jan 2014 19:13:37 +0000 (20:13 +0100)]
erasure-code: ensure that coding chunks are page aligned
When coding chunks are allocated for jerasure, their address must be
aligned to page boundaries. The requirement is actually to be aligned on
a long long boundary but bufferlist do not allow for fine tuning of the
alignment.
If padding is necessary because the total size of the data to be encoded
is not a multiple of the alignment requirements as returned by
get_alignment(), the buffer is not only padded but also rebuilt using
rebuild_page_aligned() to preserve the page alignment that is expected
of the input buffer.
The overhead of rebuilding the whole input buffer when padding is
necessary could be reduced by only reallocating one buffer for the last
data chunk, therefore reducing the amount of data being copied. However,
this optimization is not going to be used if the caller takes care of
the padding, which is likely to be the case most of the time.
Andreas Peters [Wed, 18 Dec 2013 13:47:58 +0000 (14:47 +0100)]
EC-JERASURE: rewrite region-xor function using vector operations to get ~ x1.5 speedups for erasure code and guarantee proper 64-bit/128-bit buffer alignment
Loic Dachary [Tue, 7 Jan 2014 15:49:44 +0000 (16:49 +0100)]
common: fix large output in unittest_daemon_config
All tests in daemon_config use the global g_ceph_context
object. Introducing an expansion loop in it will impact all tests and
generate a very large output.
Remove the SubstitutionLoop test case which is also covered in
test/common/test_config.cc.