Somnath Roy [Mon, 30 Jun 2014 08:28:07 +0000 (01:28 -0700)]
FileStore: Index caching is introduced for performance improvement
IndexManager now has a Index caching. Index will only be created if not
found in the cache. Earlier, each op is creating an Index object and other
ops requesting the same index needed to wait till previous op is done.
Also, after finishing lookup, this Index object was destroyed.
Now, a Index cache is been implemented to persists these Indexes since
there is a major performance hit because each op is creating and destroying
these. A RWlock is been introduced in the CollectionIndex class and that is
responsible for sync between lookup and create.
Also, since these Index objects are persistent there is no need to use
smart pointers. So, Index is a wrapper class of CollecIndex* now.
It is the responsibility of the users of Index now to lock explicitely
before using them. Index object is sufficient now for locking and no need
to hold IndexPath for locking. The function interfaces of lfn_open,lfn_find
are changed accordingly.
Signed-off-by: Somnath Roy <somnath.roy@sandisk.com>
Greg Farnum [Mon, 3 Feb 2014 22:36:02 +0000 (14:36 -0800)]
FDCache: implement a basic sharding of the FDCache
This is just a basic sharding. A more sophisticated implementation would
rely on something other than luck for keeping the distribution equitable.
The minimum FDCache shard size is 1.
Signed-off-by: Greg Farnum <greg@inktank.com> Signed-off-by: Somnath Roy <somnath.roy@sandisk.com>
Greg Farnum [Thu, 30 Jan 2014 22:21:52 +0000 (14:21 -0800)]
shared_cache: expose prior existence when inserting an element
The LRU now handles you attempting to insert multiple values for the
same key, by telling you that you've done so and returning the
existing value before it manages to muck up existing data.
The param 'existed' is not mandatory, default value is NULL.
Signed-off-by: Greg Farnum <greg@inktank.com> Signed-off-by: Somnath Roy <somnath.roy@sandisk.com>
Anand Bhat [Thu, 14 Aug 2014 04:22:56 +0000 (09:52 +0530)]
OSDMonitor: Do not allow OSD removal using setmaxosd
Description: Currently setmaxosd command allows removal of OSDs by providing
a number less than current max OSD number. This causes abrupt removal of
OSDs causing data loss as well as kernel panic when kernel RBDs are involved.
Fix is to avoid removal of OSDs if any of the OSDs in the range between
current max OSD number and new max OSD number is part of the cluster.
Sage Weil [Thu, 14 Aug 2014 00:52:25 +0000 (17:52 -0700)]
msg/PipeConnection: make methods behave on 'anon' connection
The monitor does a create_anon_connection() to create a pseudo Connection
object for forwarded messages. If we try to call mark_down or similar
on one of these we should silently ignore the operation, not crash.
If we try to send a message, still crash (explicitly assert); the caller
should probably know better.
Fixes: #9062 Signed-off-by: Sage Weil <sage@redhat.com>
Sage Weil [Wed, 13 Aug 2014 22:05:05 +0000 (15:05 -0700)]
mds/MDSMap: fix incompat version for encoding
Back in 8f7900a09c8e490c9cd3a6f92ed1f0eb1f47f2a9 we added the new fields
before the 'extended' section, which made the encoding incompatible.
Instead, add them at the end--old clients don't care whether the enabled
flag is set or what the 'fs name' is.
Fixes: #8725 Signed-off-by: Sage Weil <sage@redhat.com>
Sage Weil [Wed, 13 Aug 2014 17:34:53 +0000 (10:34 -0700)]
osd/ReplicatedPG: only do agent mode calculations for positive values
After a split we can get negative values here. Only do the arithmetic if
we have a valid (positive) value that won't through the floating point
unit for a loop.
Fixes: #9082 Tested-by: Karan Singh <karan.singh@csc.fi> Signed-off-by: Sage Weil <sage@redhat.com>
Sage Weil [Wed, 13 Aug 2014 15:30:25 +0000 (08:30 -0700)]
osd: fix require_same_peer_instance from fast_dispatch
The mark-down of old peers needs to take the session_dispatch_lock in order
to safely clear the Session ref cycle. However, for fast dispatch callers,
that lock is already held. Pass a flag down from the callers indicating
whether we need to take the additional lock.
Fixes: #9096 Signed-off-by: Sage Weil <sage@redhat.com>
Samuel Just [Tue, 12 Aug 2014 19:20:28 +0000 (12:20 -0700)]
ReplicatedPG: do not pass cop into C_Copyfrom
We do not know when the objecter will finally let go of this Context. Thus, we
cannot know whether it will happen before the flush, at which point the
object_context held by the cop must have been released.
Also, we simply don't need it, process_copy_chunk alrady works in terms of the
tid!
Fixes: #8894 Reviewed-by: Sage Weil <sage@redhat.com> Signed-off-by: Samuel Just <sam.just@inktank.com>
Loic Dachary [Tue, 12 Aug 2014 16:46:29 +0000 (18:46 +0200)]
erasure-code: isa plugin must link with ErasureCode.cc
Otherwise it will not get the methods it needs. A test is added to check
the plugin loads as expected, from the command line. The test is not run
if the isa plugin is not found, which happens on platforms that are not
supported.
Josh Durgin [Mon, 11 Aug 2014 23:41:26 +0000 (16:41 -0700)]
librbd: fix error path cleanup for opening an image
If the image doesn't exist and caching is enabled, the ObjectCacher
was not being shutdown, and the ImageCtx was leaked. The IoCtx could
later be closed while the ObjectCacher was still running, resulting in
a segfault. Simply use the usual cleanup path in open_image(), which
works fine here.
Sage Weil [Mon, 11 Aug 2014 03:22:23 +0000 (20:22 -0700)]
msg/Pipe: do not wait for self in Pipe::stop_and_wait()
The fast dispatch code necessitated adding a wait for the fast dispatch
to complete when taking over sockets back in commit 2d5d3097c3998add1061ce253104154d72879237. This included mark_down()
(although I am not certain mark_down was required to fix the previous set
of races).
In any case, if the fast dispatch thread itself tries to mark down its
own connection, it will deadlock in this method waiting for itself to
return and clear reader_dispatching. Skip this wait if we are in fact
the reader thread. This avoids the deadlock.
Alternatively, we could change mark_down() to not use stop_and_wait(), but
I am less clear about the potential races there, so I'm opting for the
minimal (though ugly) fix.
Fixes: #9057 Signed-off-by: Sage Weil <sage@redhat.com>
Sage Weil [Sun, 10 Aug 2014 19:15:38 +0000 (12:15 -0700)]
ceph_test_rados_api: fix cleanup of cache pool
We can't simply try to delete everything in there because some items may
be whiteouts. Instead, flush+evict everything, then remove overlay, and
*then* delete what remains.
Fixes: #9055 Signed-off-by: Sage Weil <sage@redhat.com>
When OSDMonitor::crush_ruleset_create_erasure checks the ruleset for
existence, it must convert the ruleid into a ruleset before assigning it
back to the *ruleset parameter.
Ma Jianpeng [Thu, 7 Aug 2014 13:33:18 +0000 (21:33 +0800)]
os/chain_xattr: Remove all old xattr entry when overwrite the xattr.
Ceph use multiple xattrs to store the value of a single xattr which size
is larger than CHAIN_XATTR_MAX_BLOCK_LEN.
But when overwote the content of xattr in func
chain_setxattr/chain_fsetxattr, we don't know the size of previous
content of the xattr.
So we only try to remove until system return -ENODATA.
Signed-off-by: Ma Jianpeng <jianpeng.ma@intel.com>
OSD: introduce require_up_osd_peer() function for gating replica ops
This checks both that a Message originates from an OSD, and that the OSD
is up in the given map epoch.
We use it in handle_replica_op so that we don't inadvertently add operations
from down peers, who might or might not know it.
test_librbd_fsx: also flatten as part of randomize_parent_overlap
With randomize_parent_overlap fsx will randomly truncate base images
after they have been cloned from. This throws flatten into the mix:
base image will be flattened with 2/16 chance (equal to the chance of
leaving the image intact).
Samuel Just [Mon, 4 Aug 2014 22:30:41 +0000 (15:30 -0700)]
OSD: move waiting_for_pg into the session structures
Each message belongs to a session. Further, no ordering is implied
between messages which arrived on different sessions. Breaking the
global waiting_for_pg structure into a per-session structure lets
us avoid the problem of taking a write lock on a global structure
(pg_map_lock) in get_pg_or_queue_for_pg at the cost of some
complexity in updating each session's waiting_for_pg structure when
we receive a new map (due to pg splits) or when we locally create
a pg.