Samuel Just [Mon, 3 Mar 2014 01:31:38 +0000 (17:31 -0800)]
TestPGLog: tests for proc_replica_log/merge_log equivalence
We need the merge_log and proc_replica_log paths to result in the
same missing set. This patch adds some machinery for specifying
a log merge scenario and comparing both paths to the same correct
result. This machinery also makes it a bit easier to read and add
new tests.
Samuel Just [Sun, 2 Mar 2014 21:42:16 +0000 (13:42 -0800)]
PGLog::proc_replica_log: _merge_divergent_entries based on truncated olog
We can't merge using the primary's log since we haven't decided whether
to send them a complete log yet. Thus, merge based on the truncated olog
rather than the primary's log. This is a consequence of the division
between trimming divergent entries in peering/unfound search and sending
a complete log to actual members of the actingbackfill set in activate().
_merge_divergent_entries on the truncated log and add_next_event() on the
newer entries result in the same missing/log regardless of the order in
which they are performed.
In the first case, we should end up with foo removed from missing
at the end. In the second, we need foo added to missing at 1,1.
It's far simpler to present all of the divergent entries for a single
object at once.
Ilya Dryomov [Fri, 21 Feb 2014 14:34:14 +0000 (16:34 +0200)]
librbd: prefix rbd writes with CEPH_OSD_OP_SETALLOCHINT osd op
In an effort to reduce fragmentation, prefix every rbd write with
a CEPH_OSD_OP_SETALLOCHINT osd op with an expected_write_size value set
to the object size (1 << order). Backwards compatibility is taken care
of on the osd side.
"The CEPH_OSD_OP_SETALLOCHINT hint is durable, in that it's enough to
do it once. The reason every rbd write is prefixed is that rbd doesn't
explicitly create objects and relies on writes creating them
implicitly, so there is no place to stick a single hint op into. To
get around that we decided to prefix every rbd write with a hint (just
like write and setattr ops, hint op will create an object implicitly if
it doesn't exist)."
Ilya Dryomov [Fri, 21 Feb 2014 14:34:13 +0000 (16:34 +0200)]
FileStore: introduce XfsFileStoreBackend class
Introduce XfsFileStoreBackend class, currently the only filestore
backend implementing SETALLOCHINT op. This commit adds a build-time
dependency on libxfs as xfs-specific ioctl (XFS_IOC_FSSETXATTR /
XFS_XFLAG_EXTSIZE) is used to implement the new set_alloc_hint()
method.
Ilya Dryomov [Fri, 21 Feb 2014 14:34:13 +0000 (16:34 +0200)]
FileStore: refactor FS detection checks a bit
Refactor FS detection checks in FileStore::_detect_fs() so that they
look the same as the ones in FileStore::mkfs(). This is in preparation
for adding XfsFileStoreBackend class.
Ilya Dryomov [Fri, 21 Feb 2014 14:34:13 +0000 (16:34 +0200)]
osd: add SETALLOCHINT operation
This is primarily for librbd/krbd's benefit and is supposed to combat
fragmentation:
"... knowing that rbd images have a 4m size, librbd can pass a hint
that will let the osd do the xfs allocation size ioctl on new files so
that they are allocated in 1m or 4m chunks. We've seen cases where
users with rbd workloads have very high levels of fragmentation in xfs
and this would mitigate that and probably have a pretty nice
performance benefit."
SETALLOCHINT is considered advisory, so our backwards compatibility
mechanism here is to set FAILOK flag for all SETALLOCHINT ops.
Loic Dachary [Fri, 28 Feb 2014 12:57:20 +0000 (13:57 +0100)]
osd: do not attempt to read past the object size
When reading from a replicated pool, trying to read more than the object
size results in a short read that does not go beyond the object size. In
erasure coded pools, objects are padded and the read will return more
bytes than the object actually contains.
Samuel Just [Sat, 1 Mar 2014 22:33:11 +0000 (14:33 -0800)]
osd_types,PG: trim mod_desc for log entries to min size
In the event that mod_desc.bl contains pointers into a large
message buffer, we'd otherwise end up keeping around the entire
MOSDECSubOpWrite which created each log entry.
Fixes: #7539 Signed-off-by: Samuel Just <sam.just@inktank.com>
Samuel Just [Sat, 1 Mar 2014 21:39:57 +0000 (13:39 -0800)]
MOSDOp: drop ops vector in clear_data()
Otherwise, clear_data on MOSDOp will leave essentially
all of the buffers intact. This is a problem since the
OpTracker mechanism relies on being able to keep the mesage
around without keeping around the data.
Danny Al-Gaaf [Sat, 1 Mar 2014 13:26:18 +0000 (14:26 +0100)]
req_state: fix uninitialized bool var
CID 717359 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
uninit_member: Non-static class member "bucket_exists" is not
initialized in this constructor nor in any functions that it calls.
Set bucket_exists to false in req_state::req_state().
Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
Danny Al-Gaaf [Sat, 1 Mar 2014 12:33:18 +0000 (13:33 +0100)]
PGMonitor: fix uninitialized scalar variable
Fix type handling in dump_stuck_pg_stats. If type is type doesn't
match to known PGMap::STUCK_* type print out a message and return
directly from function.
CID 1030132 (#2 of 2): Uninitialized scalar variable (UNINIT)
uninit_use_in_call: Using uninitialized value "stuck_type" when calling
"PGMap::dump_stuck(ceph::Formatter *, PGMap::StuckPG, utime_t) const"
Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
Danny Al-Gaaf [Sat, 1 Mar 2014 12:11:48 +0000 (13:11 +0100)]
MDCache: fix potential null pointer deref
CID 716921 (#1 of 1): Dereference after null check (FORWARD_NULL)
var_deref_model: Passing null pointer "dir" to function
"operator <<(std::ostream &, CDir &)", which dereferences it.
Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
Danny Al-Gaaf [Sat, 1 Mar 2014 11:10:56 +0000 (12:10 +0100)]
MDCache::handle_discover: fix null pointer deref
CID 716990 (#1 of 1): Dereference null return value (NULL_RETURNS)
dereference: Dereferencing a pointer that might be null "cur" when calling
"MDCache::replicate_inode(CInode *, int, ceph::bufferlist &)"
Add assert to check for return value from get_inode() as done in other places.
Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
Danny Al-Gaaf [Sat, 1 Mar 2014 10:44:39 +0000 (11:44 +0100)]
c_read_operations.cc: fix resource leak
CID 1188154 (#2 of 2): Resource leak (RESOURCE_LEAK)
overwrite_var: Overwriting "op" in "op = rados_create_read_op()" leaks
the storage that "op" points to.
Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
Danny Al-Gaaf [Sat, 1 Mar 2014 10:16:27 +0000 (11:16 +0100)]
ReplicatedBackend: check result of dynamic_cast to fix null pointer deref
CID 1188135 (#1 of 1): Unchecked dynamic_cast (FORWARD_NULL)
var_deref_model: Passing null pointer "t" to function
"RPGTransaction::get_transaction()", which dereferences it
Danny Al-Gaaf [Sat, 1 Mar 2014 00:24:37 +0000 (01:24 +0100)]
store_test.cc: fix unchecked return value
CID 1188126 (#1 of 1): Unchecked return value (CHECKED_RETURN)
2. check_return: Calling function "ObjectStore::stat(coll_t,
ghobject_t const &, stat *, bool)" without checking return value
(as is done elsewhere 8 out of 9 times).
3. unchecked_value: No check of the return value of "this->store->stat(
coll_t(this->cid), hoid, &buf, false)".
Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
Danny Al-Gaaf [Fri, 28 Feb 2014 23:19:58 +0000 (00:19 +0100)]
histogram.h: fix potential div by zero
CID 1188131 (#1 of 1): Division or modulo by zero (DIVIDE_BY_ZERO)
divide_by_zero: In expression "lower_sum * 1000000UL / total", division
by expression "total" which may be zero has undefined behavior
Added check for non zero total.
Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
Samuel Just [Wed, 26 Feb 2014 22:47:39 +0000 (14:47 -0800)]
OSD::handle_misdirected_op: handle ops to the wrong shard
OSD recomputes op target based on current OSDMap. With an EC pg, we can get
this result:
1) client at map 512 sends an op to osd 3, pg_t 3.9 based on mapping
[CRUSH_ITEM_NONE, 2, 3]/3
2) OSD 3 at map 513 remaps op to osd 3, spg_t 3.9s0 based on mapping [3, 2, 3]/3
3) PG 3.9s0 dequeues the op at epoch 512 and notices that it isn't
primary -- misdirected op
4) client resends and this time PG 3.9s0 having caught up to 513 gets it and
fulfils it
We can't compute the op target based on the sending map epoch due to
splitting. The simplest thing is to detect such cases in
OSD::handle_misdirected_op and drop them without an error (the client
will resend anyway).
David Zafman [Thu, 27 Feb 2014 23:27:56 +0000 (15:27 -0800)]
osd: stray pg ref on shutdown
Move agent_clear() from only being done when becoming replica
Do it in clear_primary_state() whenever we stop being primary
clear_primary_state() passed whether we are staying a primary
Add asserts in agent_stop() and don't need to clear agent_queue
Fixes: #7458 Signed-off-by: David Zafman <david.zafman@inktank.com>
Samuel Just [Wed, 26 Feb 2014 06:48:18 +0000 (22:48 -0800)]
osd/OSDMap: respect temp primary without temp acting
be2748c6d540891f2e1a62e7034cb44f7d04bf18 ensured that
if the temp acting mapping contains only CRUSH_ITEM_NONE,
that the acting_primary is left at -1. However, even if
acting.empty(), we need to respect a temp_primary mapping.
Thus, use _acting_primary unless acting.empty() &&
acting_primary == -1.
Samuel Just [Tue, 25 Feb 2014 20:34:57 +0000 (12:34 -0800)]
OSDMonitor: when thrashing, only generate valid temp pg mappings
Since backfill peers are no longer placed into the acting set,
temp mappings will never exceed the pool size. Also, for ec
pools, temp mappings will never be less than the pool size.
Sage Weil [Mon, 24 Feb 2014 03:54:14 +0000 (19:54 -0800)]
ceph_test_objectstore: fix i386 build (again)
test/objectstore/store_test.cc: In member function ‘void SyntheticWorkloadState::read()’:
error: test/objectstore/store_test.cc:462:23: no matching function for call to ‘swap(uint64_t&, size_t&)’
Sage Weil [Mon, 24 Feb 2014 02:23:55 +0000 (18:23 -0800)]
mon/OSDMonitor: fix osdmap encode feature logic
If we are encoding a full map based on an old Incremental that does not
encode the features, fall back to the quorum features or (barring that)
all features. Do *not* do no features or else we will end up with
encode_client_old which does not even include the extended info and will
cause the mon to crash when decoding.
This was observed when upgading a 0.76 cluster to 0.77 (all mons stopped,
upgraded, and then started)
Reported-by: Aaron Ten Clay <aarontc@aarontc.com> Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Sat, 22 Feb 2014 17:29:15 +0000 (09:29 -0800)]
mon/PGMap: fix osd_epochs update
The insert() call here does not overwrite a previous entry, which means
that the osd_epochs map is never moving forward in time. This seems to
have been broken since it was introduced in 091809b814.
Backport: emperor, dumpling Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Sun, 23 Feb 2014 18:18:02 +0000 (10:18 -0800)]
client: fix possible null dereference in create
There are two paths that jump to the out label for which 'in' can be
NULL and outp can be non-NULL. For those cases we want to fill in the
caller's pointer value (they asked for it) but we clearly cannot take
a reference.
Backport: emperor, dumpling Signed-off-by: Sage Weil <sage@inktank.com>
The FileStore's leveldb currently uses libleveldb's defaults for cache and
write buffer size, which are both 4 MB. Increase the cache size to 128MB and
the write buffer to 8MB.
Tested-by: Dmitry Smirnov <onlyjob@member.fsf.org> Signed-off-by: Sage Weil <sage@inktank.com>