]> git.apps.os.sepia.ceph.com Git - ceph.git/log
ceph.git
10 years agoDequeue XioMsg on send-fail
Matt Benjamin [Tue, 30 Dec 2014 21:51:00 +0000 (16:51 -0500)]
Dequeue XioMsg on send-fail

If a message send hard fails, don't omit to remove it from the
send_q--this results in an assert when the queue is in safe-mode,
but with safe mode disabled, send_q would be corrupted.

Dont fall through and erase the iterator twice.  Continue the loop,
as in the incoming release case.

Signed-off-by: Matt Benjamin <matt@cohortfs.com>
10 years agoReduce lock spam in XioPortal SubmitQueue.
Matt Benjamin [Tue, 16 Dec 2014 16:38:06 +0000 (11:38 -0500)]
Reduce lock spam in XioPortal SubmitQueue.

In SubmitQueue::deq, scan up to nlanes, exit immediately with
found work, recalling last lane.

Yield to Accelio iff a full scan finds no work.

xio: avoid starving run loop, don't stop loop in shutdown()
     Fix 2 issues flagged in review by Alex Rosenbaum.
     * In the XioPortal main loop, the recent reduce lock contention change
       also made easier to starve Accelio under steady send work.  Restore
       the original behavior.
     * Remove the call to xio_context_stop_loop() in XioPortal::shutdown(),
       to ensure that Accelio can finish cleaning up.

Move queue guard check.
Release der spinlock.

Signed-off-by: Matt Benjamin <matt@cohortfs.com>
10 years agoxio: initial mark_* and queueing/flow control
Matt Benjamin [Mon, 15 Dec 2014 19:53:09 +0000 (14:53 -0500)]
xio:  initial mark_* and queueing/flow control

This changes implements explicit support for Accelio sender-side
flow control, which requires queuing messages for later delivery
when the connection is ready to send.

This rquirement to queue messages for later delivery, and related
connection state logic, is substantially shared with new session
reset behavior, so we've pulled a subset of that logic foward.

Again due to shared implementation logic, this change also adds
implementations of mark_down(), mark_down_all(), mark_disposable(),
and related methods from Messenger, which were required to be
implemented after Hammer.

Add XioSubmit.h.

For now, start at state UP, READY.

When considering if a flow-controlled connection can be unblocked,
consider only the computed queue depth.  Re-activate and flush the
connection iff the computed queue depth <= 1/2 of the queue
high-water mark.

Placeholder added for byte-throttled case.
Fix lock flags abuse (found by Casey).

Discard deferred and unsent messages on unplanned disconnect.
The change causes discard_input_queue() to be called in Accelio's
on_disconnect_event() handler, as well as on mark_down().

xio: Change new established connection's state to up and ready
     Change the new established passive connection's state to up and
     ready then flush all pending msgs in input_queue

Signed-off-by: Matt Benjamin <matt@cohortfs.com>
Signed-off-by: Vu Pham <vu@mellanox.com>
Signed-off-by: Matt Benjamin <matt@cohortfs.com>
10 years agoxio: Enable Accelio flow control with msgs and bytes throttlers
Vu Pham [Thu, 11 Dec 2014 14:28:26 +0000 (06:28 -0800)]
xio: Enable Accelio flow control with msgs and bytes throttlers

* Enable Accelio flow control in general
* Read out policy for messages and bytes throttlers from connection's peer_type
* Set Accelio connection flow control with policy throttlers or default values
* Set q_high_mark for xio_connection (80% of queue_depth)
  xio: Correct q_high_mark setting

Signed-off-by: Vu Pham <vu@mellanox.com>
Signed-off-by: Matt Benjamin <matt@cohortfs.com>
10 years agoxio: Configure Accelio internal pool
Vu Pham [Thu, 11 Dec 2014 13:35:56 +0000 (05:35 -0800)]
xio: Configure Accelio internal pool

Temporarily hardcoded all 6 allocators + growing quantum and max size
1k allocator - quantum 4k - max 256k
4k allocator - quantum 4k - max 256k
16k allocator - quantum 4k - max 256k
64k allocator - quantum 1k - max 64k
256k allocator - quantum 512 - max 16k
1m allocator - quantum 128 - max 8k

Later we need to calculate the sustainable workload and dynamically
configure Accelio's interal pool accordingly

Signed-off-by: Vu Pham <vu@mellanox.com>
Signed-off-by: Matt Benjamin <matt@cohortfs.com>
10 years agoAccelio Autotools glue.
Matt Benjamin [Mon, 21 Apr 2014 18:47:09 +0000 (14:47 -0400)]
Accelio Autotools glue.

Add Accelio to build process with --enable-xio is provided.

Signed-off-by: Matt Benjamin <matt@cohortfs.com>
10 years agocmake: add xio
Casey Bodley [Thu, 4 Sep 2014 15:18:55 +0000 (11:18 -0400)]
cmake: add xio

Signed-off-by: Casey Bodley <casey@linuxbox.com>
Signed-off-by: Matt Benjamin <matt@cohortfs.com>
10 years agoCeph Accelio/RDMA Transport (XioMessenger).
Matt Benjamin [Tue, 23 Dec 2014 21:46:47 +0000 (16:46 -0500)]
Ceph Accelio/RDMA Transport (XioMessenger).

XioMessenger implements a Ceph Messenger provider for Accelio,
a high-performance messaging transport by Mellanox.  Current
Accelio is layered on ibverbs, and supports Infiniband, ROCE,
and other RDMA transports.  Future Accelio verions will support
alternative transports (including TCP), and flexible transport
selection.

config: cluster_rdma drives messenger creation
ceph_mds ceph_mon and ceph_osd use XioMessengers for cluster
communication when cluster_rdma is set
Move XioMessenger to msg/xio.
This matches the other new Messenger locations.
test: tests for tcp and xio messengers
(Not tests only.)
buffer: add subclass for xio buffers
xio: convert to Connection::send_message interface
config: -x, --xio as aliases for client_rdma
ceph-fuse: create xio messenger if client_rdma
Find XioMessenger.h and QueueStrategy.h in msg/xio.
ceph-syn: create xio messenger if client_rdma
librados: create xio messenger if client_rdma
Find XioMessenger.h and QueueStrategy.h in msg/xio.
Restore non-abort from Xio Mon integration.
Fix xio_client send count, again.
xio: must signal cond under mutex lock
xio: dispatch strategies support ms_fast_dispatch
xio: config variable xio_port_shift
remove set_port_shift() from XioMessenger, and just use the value
from the configuration
xio: don't depend on g_ceph_context for dout
XioMessenger now uses its own cct for all logging operations
the accelio log function, however, still depends on a global
CephContext. so we maintain an extra one, separate from g_ceph_context,
in XioMessenger.cc that is initialized on first construction and a
reference is held indefinitely
script: cephfsnew to automate pool and fs creation
Use new on_ow_msg_send_complete hook.
Replaces on_msg_delivered for one-way message style.
Prototype new xio_discon behavior.
On shutdown, XioPortal threads should not exit before Accelio
finalizes all sessions.
Inline join_sessions, it needs sh_mtx held across wait loop.
Fix assert on Cond::Signal.  Adds Cond2.
Avoid deadlock, xio_disconnect can deliver a session teardown event.
Also Mutex2.
(Note, Mutex2 and Cond2 are replaced by standard C++ downstream.)
Restore SimpleDispatcher Timings.
The simple_client/simple_server timings are based on a ping/pong
of messages between the client and server, unlike those of the
xio_client/server programs, which are one-way (so their corresponding
1-way bandwidth is appx. 2x what the test reports).  We assert
that the results are in general comparable, because in both setups,
a fixed number of messages (def. 50) is maintained in flight.
Wrap Accelio mempool in XioPool, add stats.
To enable stat prints, set xio_trace_mempool.  Currently, prints
to stdout at each 64K messages sent or received.
Restore _send_message(..).
Fix merge errors in simple_client, simple_dispatcher.
xio: fix for size in pool stats
Add in/outbound msg counters to XioPoolStats.
Pool stats are easier to read.
Pool stats are easier to read, and if enabled, print on session
teardown. This is a convenient time to view stats, and with a small
Make pool stats counters atomic.
Track requests using hook ctor/dtor.
Lockless, portal thread provides atomicity.
Adapt to recent changes on Accelio for_next
* Accelio options now of opaque type
* on_msg_err with extra direction param
* RDMA behavior now governed by 2 new options
     XIO_OPTNAME_MAX_INLINE_DATA
      XIO_OPTNAME_MAX_INLINE_HEADER
* Separated send and recv queue depth
xio_messenger: Change xio optname queue depth msgs
* Set 16k threshold to rdma buffers instead of send
* Change xio optname for queue depth msgs
   XIO_OPTNAME_SND/RCV_QUEUE_DEPTH_MSGS
xio_messenger: Protect Accelio queue depth.
(Minimal send flow control.)
The guard is per xio_connection, and considering batches.
Increment happens only if xio_send_msg succeeded, decrement in
on_ms_ow_send_complete and on_msg_error.  Note that we don't need
atomics because counters are touched only in the correct portal
thread.
Find XioMsg.h in msg/xio
Find XioMessenger.h and QueueStrategy.h in msg/xio (tests).
Adapt to 2 Accelio API changes.
1. xio_context_stop loop takes only 1 argument
2. xio_connect() now takes a structure argument, by reference
Set CMP0046 iif CMake version >= 3
Move XioMessenger to msg/xio
xio: fix for segfault on xio_connect()
No more Mutex2, Cond2.
xio: number of portal threads is configurable
xio: only create additional portals on bind()
xio: use QueueStrategy(1) as default
xio: Messenger factory accepts ms_type "xio"
xio: use ms_type instead of client,cluster_rdma
     removing the ability to configure the client and cluster networks
     separately in favor of a single global messenger type
     --xio is now a command-line alias for --ms_type xio
     all daemons now use the Messenger::create() factory function instead of
     conditionally creating XioMessengers
     the OSD and Monitor classes no longer need separate messengers to
     deal with both tcp/rdma clients
xio: portal binding honors ms_bind_port_min,max
xio: remove xio_port_shift
     port shifting is no longer necessary, because we won't create both tcp
     and xio messengers for the same service
     Use Accelio sglist helper macros.
     xio: make xio buffer unshareable
xio: Nuke special_handling.
Replace GENERIC with MON (requested by Sage).

Signed-off-by: Casey Bodley <casey@cohortfs.com>
Signed-off-by: Vu Pham <vu@mellanox.com>
Signed-off-by: Matt Benjamin <matt@cohortfs.com>
10 years agoCosmetic ceph_mon.cc.
Matt Benjamin [Tue, 23 Dec 2014 21:41:03 +0000 (16:41 -0500)]
Cosmetic ceph_mon.cc.

Signed-off-by: Matt Benjamin <matt@cohortfs.com>
10 years agoCosmetic ceph_osd.cc.
Matt Benjamin [Tue, 23 Dec 2014 21:37:38 +0000 (16:37 -0500)]
Cosmetic ceph_osd.cc.

Signed-off-by: Matt Benjamin <matt@cohortfs.com>
10 years agoCosmetic ceph_mds.cc.
Matt Benjamin [Tue, 23 Dec 2014 21:29:52 +0000 (16:29 -0500)]
Cosmetic ceph_mds.cc.

Signed-off-by: Matt Benjamin <matt@cohortfs.com>
10 years agoIntroduce Message flag values used by XioMessenger.
Matt Benjamin [Tue, 23 Dec 2014 21:13:22 +0000 (16:13 -0500)]
Introduce Message flag values used by XioMessenger.

These correspond to bits in Message::magic and the erstwhile
"special_handling" member.

Signed-off-by: Matt Benjamin <matt@cohortfs.com>
10 years agoAdd Message::set_src(const entity_name_t& src)
Matt Benjamin [Tue, 23 Dec 2014 22:10:07 +0000 (17:10 -0500)]
Add Message::set_src(const entity_name_t& src)

Permit setting the source endpoint of a Message.

Signed-off-by: Matt Benjamin <matt@cohortfs.com>
10 years agoRemove pure virtuals from Message::CompletionHook.
Matt Benjamin [Tue, 23 Dec 2014 20:59:06 +0000 (15:59 -0500)]
Remove pure virtuals from Message::CompletionHook.

This was introduced for XioMessenger, but is no longer used.

Signed-off-by: Matt Benjamin <matt@cohortfs.com>
10 years agoAdd intrusive list anchor for Message dispatch to Message.
Matt Benjamin [Tue, 23 Dec 2014 20:51:44 +0000 (15:51 -0500)]
Add intrusive list anchor for Message dispatch to Message.

This is currently used by XioMessenger dispatch strategies, but
could be extended to other Messenger types.

Signed-off-by: Matt Benjamin <matt@cohortfs.com>
10 years agoAdd MDataPing.
Matt Benjamin [Tue, 23 Dec 2014 20:45:16 +0000 (15:45 -0500)]
Add MDataPing.

This message type is used for Messenger testing.

Signed-off-by: Matt Benjamin <matt@cohortfs.com>
10 years agoAccelio ceph::buffer Extensions
Matt Benjamin [Tue, 23 Dec 2014 20:29:28 +0000 (15:29 -0500)]
Accelio ceph::buffer Extensions

Adds custom buffer::raw type xio_mempool, with hooks into Accelio
memory lifecycle.

The xio_mempool type is non-sharable by default.

Signed-off-by: Matt Benjamin <matt@cohortfs.com>
10 years agoCosmetic: Normalize an entity_name_t initialization in ceph-syn.
Matt Benjamin [Tue, 23 Dec 2014 19:41:52 +0000 (14:41 -0500)]
Cosmetic:  Normalize an entity_name_t initialization in ceph-syn.

Signed-off-by: Matt Benjamin <matt@cohortfs.com>
10 years agomsg: crc configuration in messenger
Casey Bodley [Wed, 3 Sep 2014 16:29:17 +0000 (12:29 -0400)]
msg: crc configuration in messenger

Add new header_crc and data_crc configuration booleans, and use
them consistently to govern whether CRC is performed in the
Message encode, decode, and transit paths.

Remove ms_nocrc, changes per Sage.
Mimimally adapt AsyncMessenger for crcflags.

Signed-off-by: Casey Bodley <casey@linuxbox.com>
Signed-off-by: Matt Benjamin <matt@cohortfs.com>
10 years agoBuild rbd-fuse as a C++ unit (matching its existing linkage).
Matt Benjamin [Thu, 1 Jan 2015 18:47:20 +0000 (13:47 -0500)]
Build rbd-fuse as a C++ unit (matching its existing linkage).

Since rbd-fuse is linking C++ libraries, link it with the C++
runtime as we already do for ceph-fuse.

Signed-off-by: Matt Benjamin <matt@cohortfs.com>
10 years agomon: OSDMonitor sends maps over connection
Casey Bodley [Tue, 9 Sep 2014 17:36:37 +0000 (13:36 -0400)]
mon: OSDMonitor sends maps over connection

Signed-off-by: Casey Bodley <casey@linuxbox.com>
Signed-off-by: Matt Benjamin <matt@cohortfs.com>
10 years agomsg: remove create_anon_connection from Messenger
Casey Bodley [Wed, 3 Sep 2014 15:53:49 +0000 (11:53 -0400)]
msg: remove create_anon_connection from Messenger

the monitor now defines its own subclass of Connection to use for
Monitor::handle_forward(), rather than tying it to the Messenger
interface

Signed-off-by: Casey Bodley <casey@linuxbox.com>
Signed-off-by: Matt Benjamin <matt@cohortfs.com>
10 years agodout: dlog_p macro for should_gather
Matt Benjamin [Thu, 4 Sep 2014 15:15:38 +0000 (11:15 -0400)]
dout: dlog_p macro for should_gather

Signed-off-by: Matt Benjamin <matt@cohortfs.com>
Signed-off-by: Casey Bodley <casey@linuxbox.com>
Signed-off-by: Matt Benjamin <matt@cohortfs.com>
10 years agoatomic: add and sub return their result
Casey Bodley [Thu, 4 Sep 2014 15:10:46 +0000 (11:10 -0400)]
atomic: add and sub return their result

Signed-off-by: Casey Bodley <casey@linuxbox.com>
Signed-off-by: Matt Benjamin <matt@cohortfs.com>
10 years agoCombined CMake Build for Hammer
Ali Maredia [Mon, 23 Jun 2014 22:32:32 +0000 (18:32 -0400)]
Combined CMake Build for Hammer

CMake Ceph Build System (Firefly)
CMake.  Add tests.
Respace src/CMakeLists.txt.
CMake.  Spacing cleanups.
CMake for Firefly is Triumphant
CMake for Giant
Adapt to Giant.
Fix installation for scripts and man pages
Fix CEPH_LIBDIR and CEPH_PKGLIBDIR defines
Add erasure-code libraries
uses try_compile() to detect support for -msse flags
Fix rados object classes
Propagate Casey's cls library change to src/test.
Fix CMake build for Hammer.
Try-add rados and common to librbd link.
Fix name and linkage of libec_lrc.
Rename arch/neon.c arm.c
Fix libcommon.a dependencies (some unit tests).

Authors:
Ali Maredia <ali@cohortfs.com>
Casey Bodley <casey@cohortfs.com>
Adam Emerson <aemerson@cohortfs.com>
Marcus Watts <mdw@cohortfs.com>
Matt Benjamin <matt@cohortfs.com>

Signed-off-by: Matt Benjamin <matt@cohortfs.com>
10 years agoNull tracepoint macro when !WITH_LTTNG.
Matt Benjamin [Sat, 30 Aug 2014 19:28:52 +0000 (15:28 -0400)]
Null tracepoint macro when !WITH_LTTNG.

Signed-off-by: Matt Benjamin <matt@cohortfs.com>
10 years agoDon't use __cplusplus to mean !__KERNEL__
Matt Benjamin [Mon, 23 Jun 2014 23:35:57 +0000 (19:35 -0400)]
Don't use __cplusplus to mean !__KERNEL__

Of course, this is Linux-centric.

Signed-off-by: Matt Benjamin <matt@cohortfs.com>
10 years agoAdd missing Messenger::create ms_type in test_msgr.
Matt Benjamin [Wed, 31 Dec 2014 16:34:46 +0000 (11:34 -0500)]
Add missing Messenger::create ms_type in test_msgr.

Fixes trivial build breakage.

Signed-off-by: Matt Benjamin <matt@cohortfs.com>
10 years agoFixup int_types.h.
Matt Benjamin [Mon, 23 Jun 2014 23:16:26 +0000 (19:16 -0400)]
Fixup int_types.h.

Signed-off-by: Matt Benjamin <matt@cohortfs.com>
10 years agoMerge remote-tracking branch 'gh/next'
Sage Weil [Wed, 14 Jan 2015 16:57:33 +0000 (08:57 -0800)]
Merge remote-tracking branch 'gh/next'

10 years agoPendingReleaseNotes: make a note about librados flag changes
Sage Weil [Tue, 13 Jan 2015 20:23:37 +0000 (12:23 -0800)]
PendingReleaseNotes: make a note about librados flag changes

Signed-off-by: Sage Weil <sage@redhat.com>
10 years agoMerge pull request #3360 from mattrichards/bump_rados_version
Sage Weil [Tue, 13 Jan 2015 20:18:04 +0000 (12:18 -0800)]
Merge pull request #3360 from mattrichards/bump_rados_version

librados: bump rados version number

Reviewed-by: Sage Weil <sage@redhat.com>
10 years ago0.91 v0.91
Jenkins [Tue, 13 Jan 2015 20:10:22 +0000 (12:10 -0800)]
0.91

10 years agoMerge pull request #2697 from ceph/wip-8900
Josh Durgin [Tue, 13 Jan 2015 19:17:29 +0000 (11:17 -0800)]
Merge pull request #2697 from ceph/wip-8900

RBD image watcher and new exclusive lock handling

Reviewed-by: Josh Durgin <jdurgin@redhat.com>
10 years agoMerge pull request #3254 from trociny/feature-10036
Samuel Just [Tue, 13 Jan 2015 18:56:29 +0000 (10:56 -0800)]
Merge pull request #3254 from trociny/feature-10036

osd: osd tree to show primary-affinity value

Reviewed-by: Loic Dachary <ldachary@redhat.com>
10 years agoMerge pull request #3281 from ceph/wip-10441-b
Samuel Just [Tue, 13 Jan 2015 18:55:29 +0000 (10:55 -0800)]
Merge pull request #3281 from ceph/wip-10441-b

osd: fix watch ordering bug 10441 option b

Reviewed-by: Samuel Just <sjust@redhat.com>
Reviewed-by: Josh Durgin <jdurgin@redhat.com>
10 years agoMerge pull request #3290 from ceph/wip-da-SCA-20150102
Samuel Just [Tue, 13 Jan 2015 18:54:45 +0000 (10:54 -0800)]
Merge pull request #3290 from ceph/wip-da-SCA-20150102

Coverity and SCA fixes

Reviewed-by: Sage Weil <sage@redhat.com>
10 years agoMerge pull request #3302 from ceph/wip-9956
Samuel Just [Tue, 13 Jan 2015 18:54:21 +0000 (10:54 -0800)]
Merge pull request #3302 from ceph/wip-9956

os/FileStore: verify kernel is new enough before using extsize ioctl

Reviewed-by: Samuel Just <sjust@redhat.com>
10 years agoMerge pull request #3305 from majianpeng/fix5
Samuel Just [Tue, 13 Jan 2015 18:53:34 +0000 (10:53 -0800)]
Merge pull request #3305 from majianpeng/fix5

fix bugs about sync_filesystem

Reviewed-by: Samuel Just <sjust@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
10 years agoMerge pull request #3364 from ceph/wip-quota-test
Gregory Farnum [Tue, 13 Jan 2015 15:08:30 +0000 (07:08 -0800)]
Merge pull request #3364 from ceph/wip-quota-test

qa: set -e explicitly in quota test

Reviewed-by: Greg Farnum <gfarnum@redhat.com>
10 years agoqa: set -e explicitly in quota test 3364/head
John Spray [Tue, 13 Jan 2015 14:58:57 +0000 (14:58 +0000)]
qa: set -e explicitly in quota test

Previously was set in hashbang, which meant
that "./quota.sh" was OK, but "sh ./quota.sh" would
just run through ignoring errors.

Signed-off-by: John Spray <john.spray@redhat.com>
10 years agoMerge pull request #3336 from ceph/wip-fs-reset
Gregory Farnum [Tue, 13 Jan 2015 14:47:04 +0000 (06:47 -0800)]
Merge pull request #3336 from ceph/wip-fs-reset

mon: implement `fs reset`

Reviewed-by: Greg Farnum <gfarnum@redhat.com>
10 years agoMerge pull request #3343 from dachary/wip-10505-centos-parted
Loic Dachary [Tue, 13 Jan 2015 10:07:55 +0000 (11:07 +0100)]
Merge pull request #3343 from dachary/wip-10505-centos-parted

tests: install parted in centos Dockerfile

Reviewed-by: Joao Eduardo Luis <joao@redhat.com>
10 years agolibrbd: flush pending AIO requests under all existing flush scenarios 2697/head
Jason Dillaman [Tue, 13 Jan 2015 04:17:50 +0000 (23:17 -0500)]
librbd: flush pending AIO requests under all existing flush scenarios

AIO requests that are waiting on the image lock should be flushed
during all existing RBD flush scenarios.  A few flush cases were
missed in the original implementation.

Signed-off-by: Jason Dillaman <dillaman@redhat.com>
10 years agolibrbd: AIO requests should retry lock requests
Jason Dillaman [Tue, 13 Jan 2015 04:14:11 +0000 (23:14 -0500)]
librbd: AIO requests should retry lock requests

Added a timer to support retrying AIO lock requests until
they are successful.

Signed-off-by: Jason Dillaman <dillaman@redhat.com>
10 years agolibrbd: differentiate between R/O vs R/W RBD features
Jason Dillaman [Mon, 3 Nov 2014 21:51:06 +0000 (16:51 -0500)]
librbd: differentiate between R/O vs R/W RBD features

The new RBD exclusive lock feature should be treated as a
feature that is only applied when the image is opened in
R/W mode.

Older clients will need to handle the updated
cls_rbd::get_features method in order to properly determine
the incompatible features for an image depending on the
current mode.

Signed-off-by: Jason Dillaman <dillaman@redhat.com>
10 years agolibrbd: Add internal unit test cases
Jason Dillaman [Tue, 21 Oct 2014 02:09:29 +0000 (22:09 -0400)]
librbd: Add internal unit test cases

The new unit tests cover the modifications made to integrate
the internal librbd functionality with the new ImageWatcher.

Signed-off-by: Jason Dillaman <dillaman@redhat.com>
10 years agolibrbd: Add ImageWatcher unit test cases
Jason Dillaman [Fri, 17 Oct 2014 13:05:22 +0000 (09:05 -0400)]
librbd: Add ImageWatcher unit test cases

Directly unit test the new ImageWatcher class to complement
the existing librbd integration tests of exclusive lock
handling.

Signed-off-by: Jason Dillaman <dillaman@redhat.com>
10 years agolibrbd: Add convenience library to support unit tests
Jason Dillaman [Sun, 16 Nov 2014 19:20:42 +0000 (14:20 -0500)]
librbd: Add convenience library to support unit tests

Unit tests need access to the private symbols of librbd no
longer exported from librbd.so.  A new librbd_internal
convenience library was created to allow access.

Signed-off-by: Jason Dillaman <dillaman@redhat.com>
10 years agorbd: Allow CLI to optionally create shared images
Jason Dillaman [Wed, 1 Oct 2014 20:12:21 +0000 (16:12 -0400)]
rbd: Allow CLI to optionally create shared images

Images that are flagged as shared cannot use the RBD
object map nor RBD mirroring features.

Signed-off-by: Jason Dillaman <dillaman@redhat.com>
10 years agolibrbd: Integrate librbd with new exclusive lock feature
Jason Dillaman [Wed, 8 Oct 2014 12:41:53 +0000 (08:41 -0400)]
librbd: Integrate librbd with new exclusive lock feature

Operations that update the image now require the exclusive lock
if the feature is enabled.  AIO write and discard operations will
automatically request the exclusive lock from the current leader
to support live-migration.

Signed-off-by: Jason Dillaman <dillaman@redhat.com>
10 years agolibrados: bump rados version number 3360/head
Matt Richards [Tue, 13 Jan 2015 00:59:42 +0000 (16:59 -0800)]
librados: bump rados version number

As a follow-on to 49d114f1fff90e5c0f206725a5eb82c0ba329376,
increment the "extra" version field so clients can easily
determine if they have a version of librados that properly
translates C API operation flags.

Signed-off-by: Matthew Richards <mattjrichards@gmail.com>
10 years agoMerge pull request #3316 from ceph/wip-10471
Josh Durgin [Tue, 13 Jan 2015 00:20:28 +0000 (16:20 -0800)]
Merge pull request #3316 from ceph/wip-10471

rgw: index swift keys appropriately

Reviewed-by: Josh Durgin <jdurgin@redhat.com>
10 years agolibrbd: Create image exclusive lock watch/notify handler
Jason Dillaman [Wed, 8 Oct 2014 12:20:47 +0000 (08:20 -0400)]
librbd: Create image exclusive lock watch/notify handler

The new watch/notify handler replaces the existing header
update watch/notify handler and adds support for managing
image exclusive lock leadership.

Signed-off-by: Jason Dillaman <dillaman@redhat.com>
10 years agoosd: enable filestore_extsize by default 3302/head
Sage Weil [Mon, 12 Jan 2015 22:00:21 +0000 (14:00 -0800)]
osd: enable filestore_extsize by default

Note that this will only get used if the kernel is new enough; if it is
older than 3.5 the option will get disabled and extsize will not be used
even if the option is set to true.

This partially reverts 01cd3cdc726a3e838bce05b355a021778b4e5db1.

Fixes: #9956
Signed-off-by: Sage Weil <sage@redhat.com>
10 years agoos/FileStore: verify kernel is new enough before using extsize ioctl
Sage Weil [Mon, 12 Jan 2015 21:59:39 +0000 (13:59 -0800)]
os/FileStore: verify kernel is new enough before using extsize ioctl

Old kernels have an XFS bug that exposes uninitialized data when the
extsize hint is set and only partially written.  This is fixed by Linux
commit aff3a9edb7080f69f07fe76a8bd089b3dfa4cb5d, documented in XFS bug
http://oss.sgi.com/bugzilla/show_bug.cgi?id=874, and tested by XFS
test xfs/229 to prevent regressions.

Notably the original bug affects kernel 3.2, which is widely deployed with
ubuntu precise 12.04.

Backport: giant, firefly
Signed-off-by: Sage Weil <sage@redhat.com>
10 years agoMerge pull request #3352 from kylinstorage/fix-10503
Gregory Farnum [Mon, 12 Jan 2015 19:33:02 +0000 (11:33 -0800)]
Merge pull request #3352 from kylinstorage/fix-10503

Fix bug 10503: http://tracker.ceph.com/issues/10503

Reviewed-by: Greg Farnum <gfarnum@redhat.com>
10 years agoMerge pull request #3203 from majianpeng/fix1
Samuel Just [Mon, 12 Jan 2015 16:39:48 +0000 (08:39 -0800)]
Merge pull request #3203 from majianpeng/fix1

avoid memcopy from librados to caller buffer

Reviewed-by: Sage Weil <sage@redhat.com>
Reviewed-by: Samuel Just <sjust@redhat.com>
10 years agoMerge pull request #3034 from dachary/wip-10017-erasure-code-repair
Samuel Just [Mon, 12 Jan 2015 16:26:08 +0000 (08:26 -0800)]
Merge pull request #3034 from dachary/wip-10017-erasure-code-repair

erasure code repair when there are two failures

Reviewed-by: Samuel Just <sjust@redhat.com>
10 years agoMerge pull request #3148 from mslovy/optimazation_wbthrottle
Samuel Just [Mon, 12 Jan 2015 16:23:26 +0000 (08:23 -0800)]
Merge pull request #3148 from mslovy/optimazation_wbthrottle

os: WBThrottle: optimize the map to unordered_map

Reviewed-by: Samuel Just <sjust@redhat.com>
10 years agomon/MDSMonitor: add confirm flag to fs reset 3336/head
John Spray [Mon, 12 Jan 2015 14:52:43 +0000 (14:52 +0000)]
mon/MDSMonitor: add confirm flag to fs reset

This was already in the command map but was not
being checked.

Signed-off-by: John Spray <john.spray@redhat.com>
10 years agoqa: add `fs reset` to cephtool tests
John Spray [Mon, 12 Jan 2015 13:54:52 +0000 (13:54 +0000)]
qa: add `fs reset` to cephtool tests

This is just a superficial "I can call it" test,
it's actual behaviour is checked elsewhere.

Signed-off-by: John Spray <john.spray@redhat.com>
10 years agomon: implement `fs reset`
John Spray [Mon, 5 Jan 2015 19:34:57 +0000 (19:34 +0000)]
mon: implement `fs reset`

This is for use in CephFS disaster recovery.  When
the metadata pool has been forcibly reset to a single-MDS
metadata tree, we would like to reset the MDSMap to match.

Signed-off-by: John Spray <john.spray@redhat.com>
10 years agoFix bug 10503: http://tracker.ceph.com/issues/10503 3352/head
Yunchuan Wen [Mon, 12 Jan 2015 05:49:32 +0000 (05:49 +0000)]
Fix bug 10503: http://tracker.ceph.com/issues/10503
ceph-fuse: quota code is not 32-bit safe for vxattr output

Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com>
10 years agoMerge pull request #2948 from ceph/wip-promote
Sage Weil [Sun, 11 Jan 2015 15:55:08 +0000 (07:55 -0800)]
Merge pull request #2948 from ceph/wip-promote

osd: promote_object separation; proxy read

Reviewed-by: Samuel Just <sjust@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
10 years agoceph_test_rados: add some debug output 2948/head
Sage Weil [Tue, 6 Jan 2015 21:01:45 +0000 (13:01 -0800)]
ceph_test_rados: add some debug output

Signed-off-by: Sage Weil <sage@redhat.com>
10 years agoosd/ReplicatedPG: improve proxy read cancelation
Sage Weil [Sun, 7 Dec 2014 01:45:28 +0000 (17:45 -0800)]
osd/ReplicatedPG: improve proxy read cancelation

Avoid taking the PG lock for a canceled read op (if we are lucky).  Recheck
after the lock is taken for good measure.

Signed-off-by: Sage Weil <sage@redhat.com>
10 years agoosd/ReplicatedPG: put proxy read completion on finisher
Sage Weil [Sun, 7 Dec 2014 01:42:51 +0000 (17:42 -0800)]
osd/ReplicatedPG: put proxy read completion on finisher

We can't use the synchronous completion callbacks (in fast dispatch
context) do to the proxy read completion work.

Signed-off-by: Sage Weil <sage@redhat.com>
10 years agoosd: tiering: avoid duplicate promotion on proxy read
Zhiqiang Wang [Fri, 28 Nov 2014 08:30:20 +0000 (16:30 +0800)]
osd: tiering: avoid duplicate promotion on proxy read

Do not promote if it is already undergoing in maybe_handle_cache.

Signed-off-by: Zhiqiang Wang <zhiqiang.wang@intel.com>
10 years agoosd: tiering: proxy instead of redirect read in writeback mode when the
Zhiqiang Wang [Wed, 26 Nov 2014 01:57:03 +0000 (09:57 +0800)]
osd: tiering: proxy instead of redirect read in writeback mode when the
cache pool is full

To preserve read op order

Signed-off-by: Zhiqiang Wang <zhiqiang.wang@intel.com>
10 years agoosd: tiering: cancel and requeue proxy read when needed
Zhiqiang Wang [Fri, 21 Nov 2014 06:01:24 +0000 (14:01 +0800)]
osd: tiering: cancel and requeue proxy read when needed

Cancel and requeue proxy read on the following cases:
1) on_shutdown
2) on_change
3) background promotion is done

Signed-off-by: Zhiqiang Wang <zhiqiang.wang@intel.com>
Conflicts:
src/osd/ReplicatedPG.cc
src/osd/ReplicatedPG.h

10 years agoosd/ReplicatedPG: allow reads to proxy etc even if blocked
Sage Weil [Tue, 9 Dec 2014 01:57:13 +0000 (17:57 -0800)]
osd/ReplicatedPG: allow reads to proxy etc even if blocked

If we are not write ordered, continue with cache checks so that we can
(among other things) proxy reads while promoting.

Note that this may reorder reads for clients, but we've decided that's okay.

Signed-off-by: Sage Weil <sage@redhat.com>
10 years agotest: add proxy read test
Zhiqiang Wang [Wed, 19 Nov 2014 03:14:46 +0000 (11:14 +0800)]
test: add proxy read test

Signed-off-by: Zhiqiang Wang <zhiqiang.wang@intel.com>
10 years agoosd: tiering: proxy reads during promote
Zhiqiang Wang [Tue, 18 Nov 2014 23:47:32 +0000 (15:47 -0800)]
osd: tiering: proxy reads during promote

wip 9980. Do proxy read and async promotion for writeback.

Signed-off-by: Zhiqiang Wang <zhiqiang.wang@intel.com>
10 years agoosd: tiering: add cache mode READPROXY
Zhiqiang Wang [Tue, 18 Nov 2014 08:10:00 +0000 (16:10 +0800)]
osd: tiering: add cache mode READPROXY

Signed-off-by: Zhiqiang Wang <zhiqiang.wang@intel.com>
10 years agoosd: tiering: add proxy read support
Zhiqiang Wang [Tue, 18 Nov 2014 07:54:47 +0000 (15:54 +0800)]
osd: tiering: add proxy read support

wip 9979

Signed-off-by: Zhiqiang Wang <zhiqiang.wang@intel.com>
10 years agoosd/ReplicatedPG: separate promotion from the triggering op
Sage Weil [Mon, 17 Nov 2014 22:02:39 +0000 (14:02 -0800)]
osd/ReplicatedPG: separate promotion from the triggering op

Remove the triggering op from the internal promote machinery.

We keep the optional op arg to promote_object() only because we may
block on an object other than the original obc.

Signed-off-by: Sage Weil <sage@redhat.com>
10 years agoosd/ReplicatedPG: pass promote error to all blocked operations
Sage Weil [Mon, 17 Nov 2014 21:06:29 +0000 (13:06 -0800)]
osd/ReplicatedPG: pass promote error to all blocked operations

This isn't the most elegant strategy, but it is the best we can do
right now.

Signed-off-by: Sage Weil <sage@redhat.com>
10 years agoosd/ReplicatedPG: drop unnecessary cache_mode checks
Sage Weil [Mon, 17 Nov 2014 20:46:51 +0000 (12:46 -0800)]
osd/ReplicatedPG: drop unnecessary cache_mode checks

This currently enumerates all cache modes except none, and we don't
arrive in this function when caching is disabled.  And creating a whiteout
is not cache_mode dependent.  Simplify!

Signed-off-by: Sage Weil <sage@redhat.com>
10 years agoosd/ReplicatdPG: adjust braces (no semantic change)
Sage Weil [Thu, 23 Oct 2014 23:53:14 +0000 (16:53 -0700)]
osd/ReplicatdPG: adjust braces (no semantic change)

Signed-off-by: Sage Weil <sage@redhat.com>
10 years agoosd/ReplicatedPG: factor out must_promote case from all cache modes
Sage Weil [Thu, 23 Oct 2014 23:52:22 +0000 (16:52 -0700)]
osd/ReplicatedPG: factor out must_promote case from all cache modes

Signed-off-by: Sage Weil <sage@redhat.com>
10 years agoosd/ReplicatedPG: factor out common exists case from all cache modes
Sage Weil [Thu, 23 Oct 2014 23:51:03 +0000 (16:51 -0700)]
osd/ReplicatedPG: factor out common exists case from all cache modes

Signed-off-by: Sage Weil <sage@redhat.com>
10 years agoosd/ReplicatedPG: make op argument to promote_object optional
Sage Weil [Thu, 23 Oct 2014 21:34:36 +0000 (14:34 -0700)]
osd/ReplicatedPG: make op argument to promote_object optional

For now, we still always pass it.  In preparation, however, we modify
promote_object() so that it will work when op is null.

Signed-off-by: Sage Weil <sage@redhat.com>
10 years agoMerge pull request #3309 from trociny/wip-9483 3484/head
Josh Durgin [Sat, 10 Jan 2015 23:25:10 +0000 (15:25 -0800)]
Merge pull request #3309 from trociny/wip-9483

OSD: add a get_latest_osdmap command to the admin socket

Reviewed-by: Josh Durgin <jdurgin@redhat.com>
10 years agoOSD: add a get_latest_osdmap command to the admin socket 3309/head
Mykola Golub [Wed, 7 Jan 2015 11:39:33 +0000 (13:39 +0200)]
OSD: add a get_latest_osdmap command to the admin socket

The command blocks and ensures we have the latest map from the
mon. This is useful in testing and to "unstick" clusters in some
odd situations.

Fixes: #9483, #9484 (maybe)
Signed-off-by: Mykola Golub <mgolub@mirantis.com>
10 years agodoc: Fix PHP librados documentation
Wido den Hollander [Sat, 10 Jan 2015 13:21:27 +0000 (14:21 +0100)]
doc: Fix PHP librados documentation

10 years agoMerge pull request #3348 from ceph/wip-mon-wishlist
Loic Dachary [Sat, 10 Jan 2015 12:54:56 +0000 (13:54 +0100)]
Merge pull request #3348 from ceph/wip-mon-wishlist

doc: mon janitorial list is now a wishlist

Reviewed-by: Loic Dachary <ldachary@redhat.com>
10 years agodoc: mon janitorial list is now a wishlist 3348/head
Joao Eduardo Luis [Sat, 10 Jan 2015 12:08:22 +0000 (12:08 +0000)]
doc: mon janitorial list is now a wishlist

Signed-off-by: Joao Eduardo Luis <joao@redhat.com>
10 years agoMerge remote-tracking branch 'gh/next'
Sage Weil [Sat, 10 Jan 2015 05:43:49 +0000 (21:43 -0800)]
Merge remote-tracking branch 'gh/next'

10 years agoMerge pull request #3327 from ceph/wip-peeringqueue
Sage Weil [Sat, 10 Jan 2015 05:43:04 +0000 (21:43 -0800)]
Merge pull request #3327 from ceph/wip-peeringqueue

osd: fix peering queue bug

Reviewed-by: Samuel Just <sjust@redhat.com>
10 years agoMerge pull request #3344 from ceph/wip-librbd-snap-unprotect
Josh Durgin [Sat, 10 Jan 2015 00:56:39 +0000 (16:56 -0800)]
Merge pull request #3344 from ceph/wip-librbd-snap-unprotect

librbd: shadow variable in snap_unprotect

Reviewed-by: Josh Durgin <jdurgin@redhat.com>
10 years agorgw: return InvalidAccessKeyId instead of AccessDenied
Yehuda Sadeh [Tue, 16 Dec 2014 20:27:54 +0000 (12:27 -0800)]
rgw: return InvalidAccessKeyId instead of AccessDenied

Fixes: #10334
Signed-off-by: Yehuda Sadeh <yehuda@redhat.com>
(cherry picked from commit 56af795b1046a4c1bfba59e1fefde272bb0e5c1e)

10 years agorgw: return SignatureDoesNotMatch instead of AccessDenied
Yehuda Sadeh [Tue, 16 Dec 2014 17:11:20 +0000 (09:11 -0800)]
rgw: return SignatureDoesNotMatch instead of AccessDenied

Fixes: #10329
Signed-off-by: Yehuda Sadeh <yehuda@redhat.com>
(cherry picked from commit ef75d720f289ce2e18c0047380a16b7688864560)

10 years agotests: install parted in centos Dockerfile 3343/head
Loic Dachary [Fri, 9 Jan 2015 23:05:53 +0000 (00:05 +0100)]
tests: install parted in centos Dockerfile

It is needed to run root ceph-disk tests.

http://tracker.ceph.com/issues/10505 Fixes: #10505

Signed-off-by: Loic Dachary <ldachary@redhat.com>
10 years agodoc: Clean up pool usage.
John Wilkins [Fri, 9 Jan 2015 22:54:30 +0000 (14:54 -0800)]
doc: Clean up pool usage.

Signed-off-by: John Wilkins <jowilkin@redhat.com>
10 years agodoc: Cleanup RGW pool usage.
John Wilkins [Fri, 9 Jan 2015 22:54:06 +0000 (14:54 -0800)]
doc: Cleanup RGW pool usage.

Signed-off-by: John Wilkins <jowilkin@redhat.com>
10 years agoMerge pull request #3341 from liewegas/wip-10504
Gregory Farnum [Fri, 9 Jan 2015 22:52:19 +0000 (14:52 -0800)]
Merge pull request #3341 from liewegas/wip-10504

client: add ceph version to metadata

Reviewed-by: Greg Farnum <gfarnum@redhat.com>
10 years agoclient: include ceph and git version in client metadata 3341/head
Sage Weil [Fri, 9 Jan 2015 22:41:34 +0000 (14:41 -0800)]
client: include ceph and git version in client metadata

Fixes: #10504
Signed-off-by: Sage Weil <sage@redhat.com>
10 years agoMerge pull request #3325 from ceph/wip-nits
Josh Durgin [Fri, 9 Jan 2015 22:30:44 +0000 (14:30 -0800)]
Merge pull request #3325 from ceph/wip-nits

allow 'ops' instead of 'dump_ops_in_flight'

Reviewed-by: Josh Durgin <jdurgin@redhat.com>
10 years agoMerge pull request #3250 from ceph/wip-10372
Josh Durgin [Fri, 9 Jan 2015 22:12:22 +0000 (14:12 -0800)]
Merge pull request #3250 from ceph/wip-10372

osdc/Objecter: improve pool deletion detection

Reviewed-by: Josh Durgin <jdurgin@redhat.com>