common, osd, tools: Add histograms to performance counters
This change adds new performance counter type: histogram.
Currently it does measure latency + request size (in bytes) of
an op similarly to l_osd_op_(r|w|rw)_lat and
l_osd_op_(r|w|rw)_(inb|outb). Histograms are 2-dimensional and
gather probes for both values at the same time. This allows
a more detailed analysis than what could be done with two
separate 1-D histograms. Still the memory footprint is negligible.
Since new data could break existing tools and the amount of data
dumped is rather large, two new admin socket commands are
introduced: perf histogram schema and perf histogram dump.
Invocation of original command should remain compatible.
There's also a new simple tool in src/tools/histogram_dump.py
which does a live dump of one histogram in local daemon.
Current configuration of histograms is hard-coded and should
cover all common cases. If it turns out to be useful, in the
future it could be loaded from the configuration.
Marcus Watts [Wed, 11 Jan 2017 05:06:15 +0000 (00:06 -0500)]
radosgw/swift: clean up flush / newline behavior.
The current code emits a newline after swift errors, but fails
to account for it when it calculates 'content-length'. This results in
some clients (go github.com/ncw/swift) producing complaints about the
unsolicited newline such as this,
Unsolicited response received on idle HTTP channel starting with "\n"; err=<nil>
This logic eliminates the newline on flush. This makes the content length
calculation correct and eliminates the stray newline.
There was already existing separator logic in the rgw plain formatter
that can emit a newline at the correct point. It had been checking
"len" to decide if previous data had been emitted, but that's reset to 0
by flush(). So, this logic adds a new per-instance variable to separately
track state that it emitted a previous item (and should emit a newline).
Fixes: http://tracker.ceph.com/issues/18473 Signed-off-by: Marcus Watts <mwatts@redhat.com> Signed-off-by: Matt Benjamin <mbenjamin@redhat.com>
Sage Weil [Thu, 26 Jan 2017 19:22:53 +0000 (14:22 -0500)]
os/bluestore: fix statfs to not include DB partition in free space
If we report the DB space as vailable, ceph thinks the OSD can store more
data and will not mark the cluster as full as easily. And in reality, we
can't actually store data in this space--only metadata. Avoid the problem
by not reporting it as available.
Fixes: http://tracker.ceph.com/issues/18599 Signed-off-by: Sage Weil <sage@redhat.com>
Yan, Zheng [Thu, 26 Jan 2017 08:58:41 +0000 (16:58 +0800)]
client: remove request from session->requests when handling forward
Client::handle_client_request_forward() reset request->mds to -1,
it should also remove request from session->requests. Otherwise
Client::kick_requests_closed() get confused.
Amir Vadai [Wed, 25 Jan 2017 08:36:00 +0000 (10:36 +0200)]
msg/RDMA: Fix broken compilation due to new argument in net.connect()
Fixes: 6e4ed291afc3 ("msg: add ms_bind_before_connect to bind before connect")
Change-Id: Ia45f215b5d59dfc8545017518e5162404059829e Signed-off-by: Amir Vadai <amir@vadai.me>
Yan, Zheng [Wed, 25 Jan 2017 07:28:23 +0000 (15:28 +0800)]
mds: don't purge strays when mds is in clientreplay state
MDS does not trim log when it's in clientreplay state. If mds hang
at clientreplay state (due to bug), purging strays can submit lots
of log events and create very large mds log.
Yan, Zheng [Wed, 25 Jan 2017 03:03:45 +0000 (11:03 +0800)]
mds: skip fragment space check for replayed request
when handling replayed request, stray directory can be different
from the stray directory used by the original request. The fragment
space check for stray directory can fail.
Matt Benjamin [Sat, 31 Dec 2016 04:30:16 +0000 (23:30 -0500)]
rgw_file: interned RGWFileHandle objects need parent refs
RGW NFS fhcache/RGWFileHandle operators assume existence of the
full chain of parents from any object to the its fs_root--this is
a consequence of the weakly-connected namespace design goal, and
not a defect.
This change ensures the invariant by taking a parent ref when
objects are interned (when a parent ref is guaranteed). Parent
refs are returned when objects are destroyed--essentially by the
invariant, such a ref must exist.
The extra ref is omitted when parent->is_root(), as that node is
not in the LRU cache.
Fixes: http://tracker.ceph.com/issues/18650 Signed-off-by: Matt Benjamin <mbenjamin@redhat.com>
Matt Benjamin [Thu, 19 Jan 2017 23:14:30 +0000 (18:14 -0500)]
rgw_file: add timed namespace invalidation
With change, librgw/rgw_file consumers can provide an invalidation
callback, which is used by the library to invalidate directories
whose contents should be forgotten.
The existing RGWLib GC mechanism is being used to drive this. New
configuration params have been added. The main configurable is
rgw_nfs_namespace_expire_secs, the expire timeout.
Updated post Yehuda review.
Fixes: http://tracker.ceph.com/issues/18651 Signed-off-by: Matt Benjamin <mbenjamin@redhat.com>
Sage Weil [Fri, 20 Jan 2017 18:59:56 +0000 (13:59 -0500)]
os/bluestore/BlueFS: increase size threshold before we flush (and generate io)
Having this too high means you might be more bursty. In practice,
though, the commit path is doing explicit syncs on small chunks
anyway. And compaction work should probably stay reasonably chunky.
Hongtong Liu [Sun, 22 Jan 2017 09:25:04 +0000 (17:25 +0800)]
os/bluestore: fix NVMEDevice::open failure if serial number ends with a number
buf in effect is the serial number in ceph.conf and
the serial number consists of 16 hexadecimal characters.
1. In order to avoid ignoring the numbers, scan buf
with isxdigit.
2. In order to ignore all the potential garbage,
scan buf from the beginning.
Signed-off-by: Hongtong Liu <hongtong.liu@istuary.com>
Kefu Chai [Thu, 19 Jan 2017 04:36:06 +0000 (12:36 +0800)]
common/BackTrace: demangle on FreeBSD also
the output on FreeBSD/clang looks like:
1: 0x44bfb3 <_Z3foov+0x413> at /usr/srcs/Ceph/work/ceph/build/bin/unittest_back_trace
2: 0x44c23e <_ZN20BackTrace_Basic_Test8TestBodyEv+0x1e> at /usr/srcs/Ceph/work/ceph/build/bin/unittest_back_trace
3: 0x4d068a <_ZN7testing8internal38HandleSehExceptionsInMethodIfSupportedINS_4TestEvEET0_PT_MS4_FS3_vEPKc+0x7a> at /usr/srcs/Ceph/work/ceph/build/bin/unittest_back_trace
4: 0x4b5977 <_ZN7testing8internal35HandleExceptionsInMethodIfSupportedINS_4TestEvEET0_PT_MS4_FS3_vEPKc+0x77> at /usr/srcs/Ceph/work/ceph/build/bin/unittest_back_trace
...
and update the test accordingly, as FreeBSD/clang uses '<>' to enclose
the mangled function and offset.
also, only demangle the C++ mangled names. those names always start with
"_Z". on FreeBSD, after demangling, "main" is turned into "unsigned
long", which does not make sense.
Yan, Zheng [Sun, 22 Jan 2017 02:24:28 +0000 (10:24 +0800)]
mds: don't modify inode that is not projected
In the slave rename prep case (rename inode to different auth mds), the
rename inode is not projected. CDir::check_rstat() gets confused if
MDCache::_project_rstat_inode_to_frag() updates inode's accounted rstat
in that case.