Boris Ranto [Fri, 15 Aug 2014 17:34:27 +0000 (19:34 +0200)]
Fix -Wno-format and -Werror=format-security options clash
This causes build failure in latest fedora builds, ceph_test_librbd_fsx adds -Wno-format cflag but the default AM_CFLAGS already contain -Werror=format-security, in previous releases, this was tolerated but in the latest fedora rawhide it no longer is, ceph_test_librbd_fsx builds fine without -Wno-format on x86_64 so there is likely no need for the flag anymore
Signed-off-by: Boris Ranto <branto@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>
Sage Weil [Fri, 15 Aug 2014 15:55:10 +0000 (08:55 -0700)]
osd: only require crush features for rules that are actually used
Often there will be a CRUSH rule present for erasure coding that uses the
new CRUSH steps or indep mode. If these rules are not referenced by any
pool, we do not need clients to support the mapping behavior. This is true
because the encoding has not changed; only the expected CRUSH output.
Fixes: #8963
Backport: firefly Signed-off-by: Sage Weil <sage@redhat.com>
Sage Weil [Wed, 13 Aug 2014 23:17:02 +0000 (16:17 -0700)]
mon/Paxos: share state and verify contiguity early in collect phase
We verify peons are contiguous and share new paxos states to catch peons
up at the end of the round. Do this each time we (potentially) get new
states via a collect message. This will allow peons to be pulled forward
and remain contiguous when they otherwise would not have been able to.
For example, if
If we got mon.1 first and then mon.2 second, we would store the new txns
and then boot mon.1 out at the end because 15..25 is not contiguous with
28..40. However, with this change, we share 26..30 to mon.1 when we get
the collect, and then 31..40 when we get mon.2's collect, pulling them
both into the final quorum.
It also breaks the 'catch-up' work into smaller pieces, which ought to
smooth out latency a bit.
Sage Weil [Thu, 14 Aug 2014 23:55:58 +0000 (16:55 -0700)]
mon/Paxos: verify all new peons are still contiguous at end of round
During the collect phase we verify that each peon has overlapping or
contiguous versions as us (and can therefore be caught up with some
series of transactions). However, we *also* assimilate any new states we
get from those peers, and that may move our own first_committed forward
in time. This means that an early responder might have originally been
contiguous, but a later one moved us forward, and when the round finished
they were not contiguous any more. This leads to a crash on the peon
when they get our first begin message.
For example:
- we have 10..20
- first peon has 5..15
- ok!
- second peon has 18..30
- we apply this state
- we are now 18..30
- we finish the round
- send commit to first peon (empty.. we aren't contiguous)
- send no commit to second peon (we match)
- we send a begin for state 31
- first peon crashes (it's lc is still 15)
Prevent this by checking at the end of the round if we are still
contiguous. If not, bootstrap. This is similar to the check we do above,
but reverse to make sure *we* aren't too far ahead of *them*.
Fixes: #9053 Signed-off-by: Sage Weil <sage@redhat.com>
Loic Dachary [Tue, 3 Jun 2014 20:20:29 +0000 (22:20 +0200)]
erasure-code: parse function for the mapping parameter
Each D letter is a data chunk. For instance:
_DDD_DDD
is going to parse into:
[ 1, 2, 3, 5, 6, 7 ]
the 0 and 4 positions are not used by chunks and do not show in the
mapping. Implement ErasureCode::parse to support a reasonable default
for the mapping parameter.
Add support for erasure code plugins that do not sequentially map the
chunks encoded to the corresponding index. This is mostly transparent to
the caller, except when it comes to retrieving the data chunks when
reading. For this purpose there needs to be a remapping function so the
caller has a way to figure out which chunks actually contain the data
and reorder them.
Somnath Roy [Mon, 30 Jun 2014 08:28:07 +0000 (01:28 -0700)]
FileStore: Index caching is introduced for performance improvement
IndexManager now has a Index caching. Index will only be created if not
found in the cache. Earlier, each op is creating an Index object and other
ops requesting the same index needed to wait till previous op is done.
Also, after finishing lookup, this Index object was destroyed.
Now, a Index cache is been implemented to persists these Indexes since
there is a major performance hit because each op is creating and destroying
these. A RWlock is been introduced in the CollectionIndex class and that is
responsible for sync between lookup and create.
Also, since these Index objects are persistent there is no need to use
smart pointers. So, Index is a wrapper class of CollecIndex* now.
It is the responsibility of the users of Index now to lock explicitely
before using them. Index object is sufficient now for locking and no need
to hold IndexPath for locking. The function interfaces of lfn_open,lfn_find
are changed accordingly.
Signed-off-by: Somnath Roy <somnath.roy@sandisk.com>
Greg Farnum [Mon, 3 Feb 2014 22:36:02 +0000 (14:36 -0800)]
FDCache: implement a basic sharding of the FDCache
This is just a basic sharding. A more sophisticated implementation would
rely on something other than luck for keeping the distribution equitable.
The minimum FDCache shard size is 1.
Signed-off-by: Greg Farnum <greg@inktank.com> Signed-off-by: Somnath Roy <somnath.roy@sandisk.com>
Greg Farnum [Thu, 30 Jan 2014 22:21:52 +0000 (14:21 -0800)]
shared_cache: expose prior existence when inserting an element
The LRU now handles you attempting to insert multiple values for the
same key, by telling you that you've done so and returning the
existing value before it manages to muck up existing data.
The param 'existed' is not mandatory, default value is NULL.
Signed-off-by: Greg Farnum <greg@inktank.com> Signed-off-by: Somnath Roy <somnath.roy@sandisk.com>
Yehuda Sadeh [Wed, 13 Aug 2014 01:30:03 +0000 (18:30 -0700)]
rgw_admin: add --min-rewrite-stripe-size for object rewrite
A new param to check whether the object has requires restriping,
checking whether a specific object stripe is bigger than the specified
size. By default it is set to 0, and in that case it'll always be
restriped. Having it set to 4M + 1 will make sure that only the objects
that weren't striped before (using default settings) will be restriped.
Anand Bhat [Thu, 14 Aug 2014 04:22:56 +0000 (09:52 +0530)]
OSDMonitor: Do not allow OSD removal using setmaxosd
Description: Currently setmaxosd command allows removal of OSDs by providing
a number less than current max OSD number. This causes abrupt removal of
OSDs causing data loss as well as kernel panic when kernel RBDs are involved.
Fix is to avoid removal of OSDs if any of the OSDs in the range between
current max OSD number and new max OSD number is part of the cluster.
Yehuda Sadeh [Tue, 12 Aug 2014 20:36:11 +0000 (13:36 -0700)]
rgw: copy object data if target bucket is in a different pool
Fixes: #9039
Backport: firefly
The new manifest does not provide a way to put the head and the tail in
separate pools. In any case, if an object is copied between buckets in
different pools, we may really just want the object to be copied, rather
than reference counted.
Sage Weil [Thu, 14 Aug 2014 00:52:25 +0000 (17:52 -0700)]
msg/PipeConnection: make methods behave on 'anon' connection
The monitor does a create_anon_connection() to create a pseudo Connection
object for forwarded messages. If we try to call mark_down or similar
on one of these we should silently ignore the operation, not crash.
If we try to send a message, still crash (explicitly assert); the caller
should probably know better.
Fixes: #9062 Signed-off-by: Sage Weil <sage@redhat.com>
Sage Weil [Wed, 13 Aug 2014 22:05:05 +0000 (15:05 -0700)]
mds/MDSMap: fix incompat version for encoding
Back in 8f7900a09c8e490c9cd3a6f92ed1f0eb1f47f2a9 we added the new fields
before the 'extended' section, which made the encoding incompatible.
Instead, add them at the end--old clients don't care whether the enabled
flag is set or what the 'fs name' is.
Fixes: #8725 Signed-off-by: Sage Weil <sage@redhat.com>