OSD: introduce require_up_osd_peer() function for gating replica ops
This checks both that a Message originates from an OSD, and that the OSD
is up in the given map epoch.
We use it in handle_replica_op so that we don't inadvertently add operations
from down peers, who might or might not know it.
Samuel Just [Mon, 4 Aug 2014 22:30:41 +0000 (15:30 -0700)]
OSD: move waiting_for_pg into the session structures
Each message belongs to a session. Further, no ordering is implied
between messages which arrived on different sessions. Breaking the
global waiting_for_pg structure into a per-session structure lets
us avoid the problem of taking a write lock on a global structure
(pg_map_lock) in get_pg_or_queue_for_pg at the cost of some
complexity in updating each session's waiting_for_pg structure when
we receive a new map (due to pg splits) or when we locally create
a pg.
Samuel Just [Tue, 5 Aug 2014 20:00:01 +0000 (13:00 -0700)]
OSD: fix wake_pg_waiters revert error in _open_lock_pg
231fe1b685bfbd3db9c81709ca39a29d696b13ad reintroduced erroneously
this call to wake_pg_waiters. All _create_lock_pg callers handle
calling wake_pg_waiters after the pg lock has been dropped.
Fixes: #8691 Signed-off-by: Samuel Just <sam.just@inktank.com>
Samuel Just [Fri, 1 Aug 2014 21:04:35 +0000 (14:04 -0700)]
osd_types: s/stashed/rollback_info_completed and set on create
Originally, this flag indicated that the object had already been stashed and
that therefore recording subsequent changes is unecessary. We want to set it
on create() as well since operations like [create, writefull] should not need
to stash the object.
Fixes: #8625 Signed-off-by: Samuel Just <sam.just@inktank.com>
In lookup_pool and pool_delete, a lock is taken
before invoking wait_for_osdmap, but is not
released for the failure case of the call. Fixing the same.
Sage Weil [Thu, 7 Aug 2014 00:28:45 +0000 (17:28 -0700)]
os/FileStore: force any new xattr into omap on E2BIG
If we have a huge xattr (or many little ones), the _fgetattrs() for the
inline_set will fail with E2BIG. The conditions later where we decide
whether to clean up the old xattr will then also fail. We *will* put
the xattr in omap, but the non-omap version isn't cleaned up.
Fix this by setting a flag if we get E2BIG that the inline_set is known
to be incomplete. In that case, take the conservative step of assuming
the xattr might be present and chain_fremovexattr(). Ignore any error
because it might not be there.
This is clearly harmless in the general case because it won't be there.
If it is, we will hopefully remove enough xattrs that the E2BIG
condition will go away (usually by removing some really big chained
xattr).
See original bug #7779. With this in place, we can repair objects in
the broken state if we know the rados attr(s) that are responsible.
Usually that is user.rgw.manifset, and a rados get + set of the attr
will repair things.
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com> Signed-off-by: Sage Weil <sage@redhat.com>
Owen Synge [Tue, 5 Aug 2014 15:28:16 +0000 (17:28 +0200)]
Do not make directories by mistake.
Rational: I found I had created a series of OSD directories under "/dev/" when disks I thought existed did not exist.
Warning: This change will be noticed by end users and may effect deployment infrastructures.
Sage Weil [Mon, 4 Aug 2014 21:57:28 +0000 (14:57 -0700)]
osd: reorder OSDService methods under proper dout_prefix macro
The dout_prefix for OSDService uses get_osdmap() to grab a shared_ptr for
the epoch printout. The OSD one does not, and is not safe to run in all
thread contexts.
In particular, update_osd_stat() is run by the heartbeat thread and can
race with the shared_ptr itself being updated with a new map.
Ironically, if this were simply an OSDMap*, there would be no race since
the pointer is a single word and updates atomically.
Fix this, and any similar issues, by moving the OSDService methods up in
OSD.cc so that they use the safe dout macro.
Fixes: #8998
Backport: firefly (in a minimal form, I think!) Signed-off-by: Sage Weil <sage@redhat.com>
rgw: call processor->handle_data() again if needed
Fixes: #8937
Following the fix to #8928 we end up accumulating pending data that
needs to be written. Beforehand it was working fine because we were
feeding it with the exact amount of bytes we were writing.
Loic Dachary [Fri, 30 May 2014 13:24:25 +0000 (15:24 +0200)]
erasure-code: HTML display of benchmark results
The ceph_erasure_code_benchmark output is converted into a JSON series
suitable to display in HTML with the http://www.flotcharts.org/
library. A self contained copy of the HTML,JS,CSS files is included for
durability and can be used from the source tree with:
CEPH_ERASURE_CODE_BENCHMARK=src/ceph_erasure_code_benchmark \
PLUGIN_DIRECTORY=src/.libs \
qa/workunits/erasure-code/bench.sh fplot jerasure |
tee qa/workunits/erasure-code/bench.js
Loic Dachary [Tue, 27 May 2014 19:45:19 +0000 (21:45 +0200)]
COPYING: Cloudwatt copyright is inline
Remove partial list of contributions since Cloudwatt copyright has been
placed in the copyright notices of the files where works covered by
copyright have been included.
Loic Dachary [Tue, 27 May 2014 17:25:22 +0000 (19:25 +0200)]
erasure-code: rework benchmark suite
Expand the default suite to enumerate all cases that are relevant to the
current code base so that it is easier to consume. Namely it means
* iterating over object sizes of 4KB (what is used by default) and
1MB (what was previous benchmarked)
* grouping results in series that would make sense to plot to get the
behavior of a given technique for a series of K/M values and all
possible erasures.
Instead of specifying the iterations to run, set the size of the total
data set to be exercised and compute the iterations by dividing it by
the object size. Since the object size varies, it is impractical to
preset the number of iterations and get meaningful results.
The PARAMETERS environment variable is added to enable the caller to
inject --parameter jerasure-variant=generic, for instance.
The packets size is calculated based on the other parameters. The
options are limited when packets are small (4KB) and it would not make a
real difference to give control over it. The packet size is capped to
a maximum of 3100 bytes which is roughly what has been found to be an
optimal value for large packets (1MB).
Loic Dachary [Fri, 30 May 2014 12:33:15 +0000 (14:33 +0200)]
erasure-code: control jerasure plugin variant selection
The jerasure-variant parameter is interpreted as the name of the plugin
variant to be loaded regardless of the available CPU features. The
values can be sse3, sse4, generic. It is undocumented and meant for
benchmarking purposes, primarily to force the generic plugin to be
loaded when the sse4 would be chosen.
Loic Dachary [Tue, 27 May 2014 16:40:45 +0000 (18:40 +0200)]
erasure-code: implement alignment on chunk sizes
jerasure expects chunk sizes that are aligned on the largest possible
vector size that could be used by SSE instructions, when available (
LARGEST_VECTOR_WORDSIZE == 16 bytes ).
For techniques derived from Cauchy, encoding and decoding is done by
subdividing the chunk into packets of packetsize bytes. The operations
are done w * packetsize bytes at a time. It follows that each chunk must
have a size that is a multiple of w * packetsize bytes.
For techniques derived from Vandermonde, it is enough for a chunk to be
a multiple of w * LARGEST_VECTOR_WORDSIZE.
ErasureCodeJerasure::get_alignment returns a size alignment constraint
that has to be enforced as a multiple of the object size. The resulting
object size then has to match the chunk constraints described above
although they have no relationship with K. For Cauchy, it leads to
excessive padding, making it impossible to set sensible parameters for
when the object size is small.
When the per_chunk_alignement data member is true, the semantic of
ErasureCodeJerasure::get_alignment is changed to return a size alignment
constraint to be enforced as a multiple of the chunk size. The
ErasureCodeJerasure::get_chunk_size method is modified to use the new
semantic when appropriate.
The jerasure-per-chunk-alignement parameter is parsed to set
per_chunk_alignement for the Vandermonde and Cauchy techniques.
The memory address of a chunk is implicitly aligned to a page boundary
because it is allocated with buffer::create_page_aligned.
Loic Dachary [Tue, 27 May 2014 16:36:09 +0000 (18:36 +0200)]
erasure-code: cauchy techniques allow w 8,16,32
Enforce the restriction at initialization time, the same way it is done
for Reed Solomon. Choosing a w value different from 8,16,32 will lead to
memory corruption that cannot easily be traced to the cause.