Sage Weil [Mon, 12 Jan 2015 22:00:21 +0000 (14:00 -0800)]
osd: enable filestore_extsize by default
Note that this will only get used if the kernel is new enough; if it is
older than 3.5 the option will get disabled and extsize will not be used
even if the option is set to true.
Sage Weil [Mon, 12 Jan 2015 21:59:39 +0000 (13:59 -0800)]
os/FileStore: verify kernel is new enough before using extsize ioctl
Old kernels have an XFS bug that exposes uninitialized data when the
extsize hint is set and only partially written. This is fixed by Linux
commit aff3a9edb7080f69f07fe76a8bd089b3dfa4cb5d, documented in XFS bug
http://oss.sgi.com/bugzilla/show_bug.cgi?id=874, and tested by XFS
test xfs/229 to prevent regressions.
Notably the original bug affects kernel 3.2, which is widely deployed with
ubuntu precise 12.04.
Backport: giant, firefly Signed-off-by: Sage Weil <sage@redhat.com>
John Spray [Mon, 5 Jan 2015 19:34:57 +0000 (19:34 +0000)]
mon: implement `fs reset`
This is for use in CephFS disaster recovery. When
the metadata pool has been forcibly reset to a single-MDS
metadata tree, we would like to reset the MDSMap to match.
Sage Weil [Mon, 17 Nov 2014 20:46:51 +0000 (12:46 -0800)]
osd/ReplicatedPG: drop unnecessary cache_mode checks
This currently enumerates all cache modes except none, and we don't
arrive in this function when caching is disabled. And creating a whiteout
is not cache_mode dependent. Simplify!
Loic Dachary [Thu, 4 Dec 2014 11:15:30 +0000 (12:15 +0100)]
erasure-code: test repair when file is removed
Add tests for when files disappear from the file system :
* file is missing from the primary OSD
* file is missing from an OSD that is not the primary
* files are missing from two OSDs that are not the primary
* files are missing from two OSDs, one of which is the primary
Loic Dachary [Thu, 4 Dec 2014 10:44:34 +0000 (11:44 +0100)]
osd: accumulate authoritative peers during recovery
When PGBackend::be_compare_scrubmaps finds multiple good peers, it only
keeps the last one. This is fine for replication but erasure coding
needs to know all good peers for recovery.
PGBackend::be_compare_scrubmaps is modified to accumulate all good peers
and return them to PGBackend::be_compare_scrubmaps and indirectly to
PG::scrub_compare_maps.
PG::scrub_compare_maps will dispatch the good peers to authmap and
good_peers. In the case of authmap, the data structure is not modified
and only the last good peer is set. The ReplicatedPG::_scrub uses
authmap in a non trivial way and it should probably be modified to use
information from multiple good peers instead of just the last one. This
could be the focus of another change.
The scrubber.authoritative data structure is changed to include a list
of pair<ScrubMap::object, pg_shard_t> instead of a single
pair<ScrubMap::object, pg_shard_t> to pass to PG::repair_object and
allow it to add all good peers to the missing_loc locations if the
primary has a missing object. It could be just a list of pg_shard_t
instead because the ScrubMap::object is not used but makes more sense to
keep both and it will presumably be useful when / if the logic changes.
Sage Weil [Thu, 8 Jan 2015 19:10:45 +0000 (11:10 -0800)]
osd: assert there is a peering event
This became conditional way back in 12e22b3d44eba51a70d8babebc2684f0c46575a7
for unclear reasons. It probably predates the in_use checks. In any case,
at this point, we should only arrive here if the PG was queued, implying
that there will always be an event to process.
Sage Weil [Thu, 8 Jan 2015 21:34:52 +0000 (13:34 -0800)]
osd: requeue PG when we skip handling a peering event
If we don't handle the event, we need to put the PG back into the peering
queue or else the event won't get processed until the next event is
queued, at which point we'll be processing events with a delay.
The queue_null is not necessary (and is a waste of effort) because the
event is still in pg->peering_queue and the PG is queued.
Note that this only triggers when we exceeed osd_map_max_advance, usually
when there is a lot of peering and recovery activity going on. A
workaround is to increase that value, but if you exceed osd_map_cache_size
you expose yourself to crache thrashing by the peering work queue, which
can cause serious problems with heavily degraded clusters and bit lots of
people on dumpling.
Backport: giant, firefly Fixes: #10431 Signed-off-by: Sage Weil <sage@redhat.com>
Matt Richards [Thu, 8 Jan 2015 21:16:17 +0000 (13:16 -0800)]
librados: Translate operation flags from C APIs
The operation flags in the public C API are a distinct enum
and need to be translated to Ceph OSD flags, like as happens in
the C++ API. It seems like the C enum and the C++ enum consciously
use the same values, so I reused the C++ translation function.
Signed-off-by: Matthew Richards <mattjrichards@gmail.com>
John Spray [Wed, 7 Jan 2015 12:37:40 +0000 (12:37 +0000)]
mon/MDSMonitor: fix `mds fail` for standby MDSs
This command takes a gid, rank or name, but
in the name case it would previously only work if
the named daemon had a rank assigned (mds_info->rank >= 0),
otherwise it would fail silently.
John Spray [Wed, 7 Jan 2015 11:47:34 +0000 (11:47 +0000)]
mon/MDSMonitor: respect MDSMAP_DOWN when promoting standbys
Previously, a standby could become active even if 'cluster_down'
had been run. This was awkward, because it would get you a
"laggy or crashed" mds for the standby that was actually
up and running, just being ignored because of cluster_down.
Loic Dachary [Fri, 19 Dec 2014 14:54:33 +0000 (15:54 +0100)]
init-ceph: stop returns before daemons are dead
The existence of the pidfile must be checked outside of the loop to send
a signal to the daemon. Otherwise the daemon will remove the pidfile and
stop can return before the process is dead because it only checks
/proc/$pid if the pidfile exists.
Danny Al-Gaaf [Wed, 7 Jan 2015 08:34:07 +0000 (09:34 +0100)]
osd/ClassHandler.cc: move stat into error handling
There is no security advantage to check if the class file
exists before opening it but the file could be removed or
exchanged between the stat and open. Instead directly open
it and fail. Check if the file was missing afterwards
for debug messages and error codes.
Make sure cls->status is set if the class open call fails.
To solve Coverity issue:
CID 743419 (#1 of 1): Time of check time of use (TOCTOU)
fs_check_call: Calling function stat to perform check on fname.
743419 Time of check time of use
An attacker could change the filename's file association or
other attributes between the check and use.
In ClassHandler::_load_class(ClassHandler::ClassData *): A check
occurs on a file's attributes before the file is used in a
privileged operation, but things may have changed (CWE-367)
Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
Danny Al-Gaaf [Fri, 2 Jan 2015 21:30:24 +0000 (22:30 +0100)]
crush/crush.c: prevent DIVIDE_BY_ZERO
Fix for:
CID 1219471 (#1 of 1): Division or modulo by zero (DIVIDE_BY_ZERO)
divide_by_zero: In function call crush_make_uniform_bucket,
division by expression item_weight which may be zero has undefined
behavior.
Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>