Loic Dachary [Tue, 27 May 2014 16:40:45 +0000 (18:40 +0200)]
erasure-code: implement alignment on chunk sizes
jerasure expects chunk sizes that are aligned on the largest possible
vector size that could be used by SSE instructions, when available (
LARGEST_VECTOR_WORDSIZE == 16 bytes ).
For techniques derived from Cauchy, encoding and decoding is done by
subdividing the chunk into packets of packetsize bytes. The operations
are done w * packetsize bytes at a time. It follows that each chunk must
have a size that is a multiple of w * packetsize bytes.
For techniques derived from Vandermonde, it is enough for a chunk to be
a multiple of w * LARGEST_VECTOR_WORDSIZE.
ErasureCodeJerasure::get_alignment returns a size alignment constraint
that has to be enforced as a multiple of the object size. The resulting
object size then has to match the chunk constraints described above
although they have no relationship with K. For Cauchy, it leads to
excessive padding, making it impossible to set sensible parameters for
when the object size is small.
When the per_chunk_alignement data member is true, the semantic of
ErasureCodeJerasure::get_alignment is changed to return a size alignment
constraint to be enforced as a multiple of the chunk size. The
ErasureCodeJerasure::get_chunk_size method is modified to use the new
semantic when appropriate.
The jerasure-per-chunk-alignement parameter is parsed to set
per_chunk_alignement for the Vandermonde and Cauchy techniques.
The memory address of a chunk is implicitly aligned to a page boundary
because it is allocated with buffer::create_page_aligned.
Loic Dachary [Tue, 27 May 2014 16:36:09 +0000 (18:36 +0200)]
erasure-code: cauchy techniques allow w 8,16,32
Enforce the restriction at initialization time, the same way it is done
for Reed Solomon. Choosing a w value different from 8,16,32 will lead to
memory corruption that cannot easily be traced to the cause.
Zhiqiang Wang [Fri, 1 Aug 2014 08:09:50 +0000 (16:09 +0800)]
osd: add local_mtime to struct object_info_t
This fixes a bug when the time of the OSDs and clients are not
synchronized (especially when client is ahead of OSD), and the cache
tier dirty ratio reaches the threshold, the agent skips the flush work
because it thinks the object is too young.
Sage Weil [Fri, 1 Aug 2014 03:59:49 +0000 (20:59 -0700)]
osd: do not leak Session* ref in _send_boot()
The get_priv() call returns a ref; make sure we drop it if it exists.
This doesn't happen on every run because usually it is NULL and we take
the other path; it's only after the OSD has been marked down that we reach
the second path.
Ma Jianpeng [Thu, 31 Jul 2014 02:19:32 +0000 (10:19 +0800)]
ReplicatedPG: For async-read, set the real result after completing read.
When reading an object from replicated pool, ceph uses sync mode,
so it can set the results in execute_ctx correctly.
However, For the async-read in EC Pool, current code didn't set the
real results after read in complete_read_ctx.
Signed-off-by: Ma Jianpeng <jianpeng.ma@intel.com>
Sage Weil [Thu, 31 Jul 2014 18:02:55 +0000 (11:02 -0700)]
mon/OSDMonitor: warn when cache pools do not have hit_sets configured
Give users a clue when cache pools are enabled but the hit_set is not
configured. Note that technically this will work, but not well, so for
now let's just steer them away.
Sage Weil [Thu, 31 Jul 2014 16:13:11 +0000 (09:13 -0700)]
osd/ReplicatedPG: check agent_mode if agent is enabled but hit_sets aren't
It is probably not a good idea to try to run the tiering agent without a
hit_set to inform its actions, but it is technically possible. For
example, one could simply blindly evict when we reach the full point.
However, this doesn't work because the agent mode is guarded by a hit_set
check, even though agent_setup() is not. Fix that.
Sage Weil [Wed, 30 Jul 2014 20:40:33 +0000 (13:40 -0700)]
test/cli-integration/rbd: fix trailing space
Newer versions of json.tool remove the trailing ' ' after the comma. Add
it back in with sed so that the .t works on both old and new versions, and
so that we don't have to remove the trailing spaces from all of the test
cases.
Backport: firefly Fixes: #8920 Signed-off-by: Sage Weil <sage@redhat.com>
Loic Dachary [Wed, 18 Jun 2014 22:49:13 +0000 (00:49 +0200)]
erasure-code: create default profile if necessary
After an upgrade to firefly, the existing Ceph clusters do not have the
default erasure code profile. Although it may be created with
ceph osd erasure-code-profile set default
it was not included in the release notes and is confusing for the
administrator.
The *osd pool create* and *osd crush rule create-erasure* commands are
modified to implicitly create the default erasure code profile if it is
not found.
In order to avoid code duplication, the default erasure code profile
code creation that happens when a new firefly ceph cluster is created is
encapsulated in the OSDMap::get_erasure_code_profile_default method.
Conversely, handling the pending change in OSDMonitor is not
encapsulated in a function but duplicated instead. If it was a function
the caller would need a switch to distinguish between the case when goto
wait is needed, or goto reply or proceed because nothing needs to be
done. It is unclear if having a function would lead to smaller or more
maintainable code.
John Spray [Tue, 22 Jul 2014 01:08:08 +0000 (02:08 +0100)]
mds: handle replaying old format journals
To get back to the reformatting procedure that otherwise
occurs during MDLog::open, introduce an MDLog::reopen call
that MDS can use in the standbyreplay->standby transition
for the special case where the journal is old.
Fixes: #8869 Signed-off-by: John Spray <john.spray@redhat.com>
John Spray [Mon, 21 Jul 2014 17:50:07 +0000 (18:50 +0100)]
mds: refactor MDS boot
* Make boot_start private.
* Define boot stages in enum, replace int with type.
* Merge steps 0 and 1, 0 always fell through to 1.
* starting_done was only ever reached by a fall through
from the previous step, so call it directly from there.
John Spray [Thu, 17 Jul 2014 23:44:38 +0000 (00:44 +0100)]
mds: separate inode recovery queue from MDCache
Refactor to:
* have somewhere to put some logic for doing
background recovery in future.
* trim a few lines from the oversized MDCache.cc
whereever we can.