Matt Benjamin [Tue, 30 Dec 2014 21:51:00 +0000 (16:51 -0500)]
Dequeue XioMsg on send-fail
If a message send hard fails, don't omit to remove it from the
send_q--this results in an assert when the queue is in safe-mode,
but with safe mode disabled, send_q would be corrupted.
Dont fall through and erase the iterator twice. Continue the loop,
as in the incoming release case.
Matt Benjamin [Tue, 16 Dec 2014 16:38:06 +0000 (11:38 -0500)]
Reduce lock spam in XioPortal SubmitQueue.
In SubmitQueue::deq, scan up to nlanes, exit immediately with
found work, recalling last lane.
Yield to Accelio iff a full scan finds no work.
xio: avoid starving run loop, don't stop loop in shutdown()
Fix 2 issues flagged in review by Alex Rosenbaum.
* In the XioPortal main loop, the recent reduce lock contention change
also made easier to starve Accelio under steady send work. Restore
the original behavior.
* Remove the call to xio_context_stop_loop() in XioPortal::shutdown(),
to ensure that Accelio can finish cleaning up.
Matt Benjamin [Mon, 15 Dec 2014 19:53:09 +0000 (14:53 -0500)]
xio: initial mark_* and queueing/flow control
This changes implements explicit support for Accelio sender-side
flow control, which requires queuing messages for later delivery
when the connection is ready to send.
This rquirement to queue messages for later delivery, and related
connection state logic, is substantially shared with new session
reset behavior, so we've pulled a subset of that logic foward.
Again due to shared implementation logic, this change also adds
implementations of mark_down(), mark_down_all(), mark_disposable(),
and related methods from Messenger, which were required to be
implemented after Hammer.
Add XioSubmit.h.
For now, start at state UP, READY.
When considering if a flow-controlled connection can be unblocked,
consider only the computed queue depth. Re-activate and flush the
connection iff the computed queue depth <= 1/2 of the queue
high-water mark.
Placeholder added for byte-throttled case.
Fix lock flags abuse (found by Casey).
Discard deferred and unsent messages on unplanned disconnect.
The change causes discard_input_queue() to be called in Accelio's
on_disconnect_event() handler, as well as on mark_down().
xio: Change new established connection's state to up and ready
Change the new established passive connection's state to up and
ready then flush all pending msgs in input_queue
Signed-off-by: Matt Benjamin <matt@cohortfs.com> Signed-off-by: Vu Pham <vu@mellanox.com> Signed-off-by: Matt Benjamin <matt@cohortfs.com>
Vu Pham [Thu, 11 Dec 2014 14:28:26 +0000 (06:28 -0800)]
xio: Enable Accelio flow control with msgs and bytes throttlers
* Enable Accelio flow control in general
* Read out policy for messages and bytes throttlers from connection's peer_type
* Set Accelio connection flow control with policy throttlers or default values
* Set q_high_mark for xio_connection (80% of queue_depth)
xio: Correct q_high_mark setting
Signed-off-by: Vu Pham <vu@mellanox.com> Signed-off-by: Matt Benjamin <matt@cohortfs.com>
Matt Benjamin [Tue, 23 Dec 2014 21:46:47 +0000 (16:46 -0500)]
Ceph Accelio/RDMA Transport (XioMessenger).
XioMessenger implements a Ceph Messenger provider for Accelio,
a high-performance messaging transport by Mellanox. Current
Accelio is layered on ibverbs, and supports Infiniband, ROCE,
and other RDMA transports. Future Accelio verions will support
alternative transports (including TCP), and flexible transport
selection.
config: cluster_rdma drives messenger creation
ceph_mds ceph_mon and ceph_osd use XioMessengers for cluster
communication when cluster_rdma is set
Move XioMessenger to msg/xio.
This matches the other new Messenger locations.
test: tests for tcp and xio messengers
(Not tests only.)
buffer: add subclass for xio buffers
xio: convert to Connection::send_message interface
config: -x, --xio as aliases for client_rdma
ceph-fuse: create xio messenger if client_rdma
Find XioMessenger.h and QueueStrategy.h in msg/xio.
ceph-syn: create xio messenger if client_rdma
librados: create xio messenger if client_rdma
Find XioMessenger.h and QueueStrategy.h in msg/xio.
Restore non-abort from Xio Mon integration.
Fix xio_client send count, again.
xio: must signal cond under mutex lock
xio: dispatch strategies support ms_fast_dispatch
xio: config variable xio_port_shift
remove set_port_shift() from XioMessenger, and just use the value
from the configuration
xio: don't depend on g_ceph_context for dout
XioMessenger now uses its own cct for all logging operations
the accelio log function, however, still depends on a global
CephContext. so we maintain an extra one, separate from g_ceph_context,
in XioMessenger.cc that is initialized on first construction and a
reference is held indefinitely
script: cephfsnew to automate pool and fs creation
Use new on_ow_msg_send_complete hook.
Replaces on_msg_delivered for one-way message style.
Prototype new xio_discon behavior.
On shutdown, XioPortal threads should not exit before Accelio
finalizes all sessions.
Inline join_sessions, it needs sh_mtx held across wait loop.
Fix assert on Cond::Signal. Adds Cond2.
Avoid deadlock, xio_disconnect can deliver a session teardown event.
Also Mutex2.
(Note, Mutex2 and Cond2 are replaced by standard C++ downstream.)
Restore SimpleDispatcher Timings.
The simple_client/simple_server timings are based on a ping/pong
of messages between the client and server, unlike those of the
xio_client/server programs, which are one-way (so their corresponding
1-way bandwidth is appx. 2x what the test reports). We assert
that the results are in general comparable, because in both setups,
a fixed number of messages (def. 50) is maintained in flight.
Wrap Accelio mempool in XioPool, add stats.
To enable stat prints, set xio_trace_mempool. Currently, prints
to stdout at each 64K messages sent or received.
Restore _send_message(..).
Fix merge errors in simple_client, simple_dispatcher.
xio: fix for size in pool stats
Add in/outbound msg counters to XioPoolStats.
Pool stats are easier to read.
Pool stats are easier to read, and if enabled, print on session
teardown. This is a convenient time to view stats, and with a small
Make pool stats counters atomic.
Track requests using hook ctor/dtor.
Lockless, portal thread provides atomicity.
Adapt to recent changes on Accelio for_next
* Accelio options now of opaque type
* on_msg_err with extra direction param
* RDMA behavior now governed by 2 new options
XIO_OPTNAME_MAX_INLINE_DATA
XIO_OPTNAME_MAX_INLINE_HEADER
* Separated send and recv queue depth
xio_messenger: Change xio optname queue depth msgs
* Set 16k threshold to rdma buffers instead of send
* Change xio optname for queue depth msgs
XIO_OPTNAME_SND/RCV_QUEUE_DEPTH_MSGS
xio_messenger: Protect Accelio queue depth.
(Minimal send flow control.)
The guard is per xio_connection, and considering batches.
Increment happens only if xio_send_msg succeeded, decrement in
on_ms_ow_send_complete and on_msg_error. Note that we don't need
atomics because counters are touched only in the correct portal
thread.
Find XioMsg.h in msg/xio
Find XioMessenger.h and QueueStrategy.h in msg/xio (tests).
Adapt to 2 Accelio API changes.
1. xio_context_stop loop takes only 1 argument
2. xio_connect() now takes a structure argument, by reference
Set CMP0046 iif CMake version >= 3
Move XioMessenger to msg/xio
xio: fix for segfault on xio_connect()
No more Mutex2, Cond2.
xio: number of portal threads is configurable
xio: only create additional portals on bind()
xio: use QueueStrategy(1) as default
xio: Messenger factory accepts ms_type "xio"
xio: use ms_type instead of client,cluster_rdma
removing the ability to configure the client and cluster networks
separately in favor of a single global messenger type
--xio is now a command-line alias for --ms_type xio
all daemons now use the Messenger::create() factory function instead of
conditionally creating XioMessengers
the OSD and Monitor classes no longer need separate messengers to
deal with both tcp/rdma clients
xio: portal binding honors ms_bind_port_min,max
xio: remove xio_port_shift
port shifting is no longer necessary, because we won't create both tcp
and xio messengers for the same service
Use Accelio sglist helper macros.
xio: make xio buffer unshareable
xio: Nuke special_handling.
Replace GENERIC with MON (requested by Sage).
Signed-off-by: Casey Bodley <casey@cohortfs.com> Signed-off-by: Vu Pham <vu@mellanox.com> Signed-off-by: Matt Benjamin <matt@cohortfs.com>
Add new header_crc and data_crc configuration booleans, and use
them consistently to govern whether CRC is performed in the
Message encode, decode, and transit paths.
Remove ms_nocrc, changes per Sage.
Mimimally adapt AsyncMessenger for crcflags.
Signed-off-by: Casey Bodley <casey@linuxbox.com> Signed-off-by: Matt Benjamin <matt@cohortfs.com>
Ali Maredia [Mon, 23 Jun 2014 22:32:32 +0000 (18:32 -0400)]
Combined CMake Build for Hammer
CMake Ceph Build System (Firefly)
CMake. Add tests.
Respace src/CMakeLists.txt.
CMake. Spacing cleanups.
CMake for Firefly is Triumphant
CMake for Giant
Adapt to Giant.
Fix installation for scripts and man pages
Fix CEPH_LIBDIR and CEPH_PKGLIBDIR defines
Add erasure-code libraries
uses try_compile() to detect support for -msse flags
Fix rados object classes
Propagate Casey's cls library change to src/test.
Fix CMake build for Hammer.
Try-add rados and common to librbd link.
Fix name and linkage of libec_lrc.
Rename arch/neon.c arm.c
Fix libcommon.a dependencies (some unit tests).
Authors:
Ali Maredia <ali@cohortfs.com>
Casey Bodley <casey@cohortfs.com>
Adam Emerson <aemerson@cohortfs.com>
Marcus Watts <mdw@cohortfs.com>
Matt Benjamin <matt@cohortfs.com>
Jason Dillaman [Tue, 13 Jan 2015 04:17:50 +0000 (23:17 -0500)]
librbd: flush pending AIO requests under all existing flush scenarios
AIO requests that are waiting on the image lock should be flushed
during all existing RBD flush scenarios. A few flush cases were
missed in the original implementation.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Jason Dillaman [Mon, 3 Nov 2014 21:51:06 +0000 (16:51 -0500)]
librbd: differentiate between R/O vs R/W RBD features
The new RBD exclusive lock feature should be treated as a
feature that is only applied when the image is opened in
R/W mode.
Older clients will need to handle the updated
cls_rbd::get_features method in order to properly determine
the incompatible features for an image depending on the
current mode.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Jason Dillaman [Sun, 16 Nov 2014 19:20:42 +0000 (14:20 -0500)]
librbd: Add convenience library to support unit tests
Unit tests need access to the private symbols of librbd no
longer exported from librbd.so. A new librbd_internal
convenience library was created to allow access.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Jason Dillaman [Wed, 8 Oct 2014 12:41:53 +0000 (08:41 -0400)]
librbd: Integrate librbd with new exclusive lock feature
Operations that update the image now require the exclusive lock
if the feature is enabled. AIO write and discard operations will
automatically request the exclusive lock from the current leader
to support live-migration.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Matt Richards [Tue, 13 Jan 2015 00:59:42 +0000 (16:59 -0800)]
librados: bump rados version number
As a follow-on to 49d114f1fff90e5c0f206725a5eb82c0ba329376,
increment the "extra" version field so clients can easily
determine if they have a version of librados that properly
translates C API operation flags.
Signed-off-by: Matthew Richards <mattjrichards@gmail.com>
Sage Weil [Mon, 12 Jan 2015 22:00:21 +0000 (14:00 -0800)]
osd: enable filestore_extsize by default
Note that this will only get used if the kernel is new enough; if it is
older than 3.5 the option will get disabled and extsize will not be used
even if the option is set to true.
Sage Weil [Mon, 12 Jan 2015 21:59:39 +0000 (13:59 -0800)]
os/FileStore: verify kernel is new enough before using extsize ioctl
Old kernels have an XFS bug that exposes uninitialized data when the
extsize hint is set and only partially written. This is fixed by Linux
commit aff3a9edb7080f69f07fe76a8bd089b3dfa4cb5d, documented in XFS bug
http://oss.sgi.com/bugzilla/show_bug.cgi?id=874, and tested by XFS
test xfs/229 to prevent regressions.
Notably the original bug affects kernel 3.2, which is widely deployed with
ubuntu precise 12.04.
Backport: giant, firefly Signed-off-by: Sage Weil <sage@redhat.com>
John Spray [Mon, 5 Jan 2015 19:34:57 +0000 (19:34 +0000)]
mon: implement `fs reset`
This is for use in CephFS disaster recovery. When
the metadata pool has been forcibly reset to a single-MDS
metadata tree, we would like to reset the MDSMap to match.
Sage Weil [Mon, 17 Nov 2014 20:46:51 +0000 (12:46 -0800)]
osd/ReplicatedPG: drop unnecessary cache_mode checks
This currently enumerates all cache modes except none, and we don't
arrive in this function when caching is disabled. And creating a whiteout
is not cache_mode dependent. Simplify!