Sage Weil [Tue, 22 Jul 2014 20:16:11 +0000 (13:16 -0700)]
osd/ReplicatedPG: greedily take write_lock for copyfrom finish, snapdir
In the cases where we are taking a write lock and are careful
enough that we know we should succeed (i.e, we assert(got)),
use the get_write_greedy() variant that skips the checks for
waiters (be they ops or backfill) that are normally necessary
to avoid starvation. We don't care about staration here
because our op is already in-progress and can't easily be
aborted, and new ops won't start because they do make those
checks.
Fixes: #8889 Signed-off-by: Sage Weil <sage@redhat.com>
Sage Weil [Tue, 22 Jul 2014 20:11:42 +0000 (13:11 -0700)]
osd: allow greedy get_write() for ObjectContext locks
There are several lockers that need to take a write lock
because there is an operation that is already in progress and
know it is safe to do so. In particular, they need to skip
the starvation checks (op waiters, backfill waiting).
osd: init local_connection for fast_dispatch in _send_boot()
We were not properly setting up Sessions on the local_connection for
fast_dispatch'ed Messages if the cluster_addr was set explicitly: the OSD
was not in the dispatch list at bind() time (in ceph_osd.cc), and nothing
called it later on. This issue was missed in testing because Inktank only
uses unified NICs.
That led to errors like the following:
When do ec-read, i met a bug which was occured 100%. The messages are:
2014-07-14 10:03:07.318681 7f7654f6e700 -1 osd/OSD.cc: In function
'virtual void OSD::ms_fast_dispatch(Message*)' thread 7f7654f6e700 time
2014-07-14 10:03:07.316782 osd/OSD.cc: 5019: FAILED assert(session)
ceph version 0.82-585-g79f3f67 (79f3f6749122ce2944baa70541949d7ca75525e6)
1: (OSD::ms_fast_dispatch(Message*)+0x286) [0x6544b6]
2: (DispatchQueue::fast_dispatch(Message*)+0x56) [0xb059d6]
3: (DispatchQueue::run_local_delivery()+0x6b) [0xb08e0b]
4: (DispatchQueue::LocalDeliveryThread::entry()+0xd) [0xa4a5fd]
5: (()+0x8182) [0x7f7665670182]
6: (clone()+0x6d) [0x7f7663a1130d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
To resolve this, we have the OSD invoke ms_handle_fast_connect() explicitly
in send_boot(). It's not really an appropriate location, but we're already
doing a bunch of messenger twiddling there, so it's acceptable for now.
Signed-off-by: Ma Jianpeng <jianpeng.ma@intel.com> Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 9061988ec7eaa922e2b303d9eece86e7c8ee0fa1)
Only decode + characters to spaces if we're in a query argument. The +
query argument. The + => ' ' translation is not correct for
file/directory names.
Resolves http://tracker.ceph.com/issues/8702
Reviewed-by: Yehuda Sadeh <yehuda@redhat.com> Signed-off-by: Brian Rak <dn@devicenull.org>
Sage Weil [Thu, 17 Jul 2014 23:40:06 +0000 (16:40 -0700)]
logrotate.conf: fix osd log rotation under upstart
In commit 7411c3c6a42bef5987bdd76b1812b01686303502 we generalized this
enumeration code by copying what was in the upstart scripts. However,
while the mon and mds directories get a 'done' file, the OSDs get a 'ready'
file. Bah! Trigger off of either one.
Backport: firefly Signed-off-by: Sage Weil <sage@redhat.com>
rgw: don't try to wait for pending if list is empty
Fixes: #8846
Backport: firefly, dumpling
This was broken at ea68b9372319fd0bab40856db26528d36359102e. We ended
up calling wait_pending_front() when pending list was empty.
This commit also moves the need_to_wait check to a different place,
where we actually throttle (and not just drain completed IOs).
Sage Weil [Wed, 16 Jul 2014 01:11:41 +0000 (18:11 -0700)]
init-ceph: wrap daemon startup with systemd-run when running under systemd
We want to make sure the daemon runs in its own systemd environment. Check
for systemd as pid 1 and, when present, use systemd-run -r <cmd> to do
this.
Probably fixes #7627
Signed-off-by: Sage Weil <sage@redhat.com> Reviewed-by: Dan Mick <dan.mick@inktank.com> Tested-by: Dan Mick <dan.mick@inktank.com>
Sage Weil [Fri, 11 Jul 2014 18:31:22 +0000 (11:31 -0700)]
osd/osd_types: be pedantic about encoding last_force_op_resend without feature bit
The addition of the value is completely backward compatible, but if the
mon feature bits don't match it can cause monitor scrub noice (due to the
parallel OSDMap encoding). Avoid that by only adding the new field if the
feature (which was added 2 patches after the encoding, see 3152faf79f498a723ae0fe44301ccb21b15a96ab and 45e79a17a932192995f8328ae9f6e8a2a6348d10.
Fixes: #8815
Backport: firefly Signed-off-by: Sage Weil <sage@redhat.com>
Revert "qa: add an fsx run which turns on kernel debugging"
This reverts commit 29c33f0c057acc4e0f4e5022c97553a2dc095b21.
We don't need the debugging any more, and having two separate fsx runners
already caused one update-in-the-wrong-place issue.
Sage Weil [Fri, 9 May 2014 15:41:33 +0000 (08:41 -0700)]
osd: cancel agent_timer events on shutdown
We need to cancel all agent timer events on shutdown. This also needs to
happen early so that any in-progress events will execute before we start
flushing and cleaning up PGs.
Backport: firefly Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Tue, 8 Jul 2014 23:11:44 +0000 (16:11 -0700)]
osd: s/applying repop/canceling repop/
The 'applying' language dates back to when we would wait for acks from
replicas before applying writes locally. We don't do any of that any more;
now, this loop just cancels the repops with remove_repop() and some other
cleanup.
Sage Weil [Tue, 8 Jul 2014 23:10:58 +0000 (16:10 -0700)]
osd: separate cleanup from PGBackend::on_change()
The generic portion of on_change() cleaned up temporary on-disk objects
and requires a Transaction. The rest is clearing out in-memory state and
does not. Separate the two.
Sage Weil [Tue, 1 Jul 2014 21:31:11 +0000 (14:31 -0700)]
osd: clear Sessions for loopback Connections on shutdown
Starting with the fast dispatch patches, we are calling the handle_connect
on loopback. Make sure we zap them on shutdown to break the Session <->
Connection ref cycle.
mon: check changes to the whole CRUSH map and to tunables against cluster features
When we change the tunables, or set a new CRUSH map, we need to make sure it's
supported by all the monitors and OSDs currently participating in the cluster.
If the test is run against a cluster started with vstart.sh (which is
the case for make check), the --asok-does-not-need-root disables the use
of sudo and allows the test to run without requiring privileged user
permissions.
Sage Weil [Wed, 2 Jul 2014 17:38:43 +0000 (10:38 -0700)]
qa/workunits/rest/test.py: make osd create test idempotent
Avoid possibility that we create multiple OSDs do to retries by passing in
the optional uuid arg. (A stray osd id will make the osd tell tests a
few lines down fail.)
Fixes: #8728 Signed-off-by: Sage Weil <sage@inktank.com>
Greg Farnum [Fri, 27 Jun 2014 21:59:23 +0000 (14:59 -0700)]
OSD: await_reserved_maps() prior to calling mark_down
send_message_osd_cluster() et al are *trying* to protect their Connection
lookups (and not re-open zapped Connections) via map reservations, but
that only works if we know that we haven't already called mark_down() on
the entities they might be looking up. So we need to await_reserved_maps
before we do any mark_down calls.
Since the waiting might take some time (fast dispatch in progress), only do
so if we are actually going to mark somebody down.
Fixes: #8512 Signed-off-by: Greg Farnum <greg@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Yehuda Sadeh [Mon, 16 Jun 2014 18:48:24 +0000 (11:48 -0700)]
rgw: allocate enough space for bucket instance id
Fixes: #8608
Backport: dumpling, firefly
Bucket instance id is a concatenation of zone name, rados instance id,
and a running counter. We need to allocate enough space to account zone
name length.
Danny Al-Gaaf [Thu, 26 Jun 2014 03:22:02 +0000 (05:22 +0200)]
common/fd.cc: fix possible out-of-bounds write
Read max 'sizeof(target) - 1' to not write out of bound
later on the 'target[r] = 0;' call in case we read the
full PATH_MAX.
CID 1128416 (#1 of 1): Out-of-bounds write (OVERRUN)
overrun-local: Overrunning array target of 4096 bytes
at byte offset 4096 using index r (which evaluates to 4096).
Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
Commit 7dc93a9651f602d9c46311524fc6b54c2f1ac595 fixed an incorrect
behavior with the OSD's 'osd bench' value hard-caps. The test wasn't
appropriately modified unfortunately.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
osd: OSD: better explanation on 'max_count' calculation for 'osd bench'
'max_count' is the maximum number of bytes that we are to allow for an
'osd bench' command. This value is a hard-cap that takes into account
a predefined throughput, the 'osd bench' duration and, for a rather large
block size, can be taken as the amount of bytes that we would ever be
able to write in that period of time.
The explanation wasn't appropriate enough, hence this patch, which
hopefully will avoid confusion in the future.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
qa/workunits: cephtool: split into properly indented functions
The test was a big sequence of commands being run and it has been growing
organically for a while, even though it has maintained a sense of
locality with regard to the portions being tested.
This patch intends to split the commands into functions, allowing for a
better semantic context and easier expansion. On the other hand, this
will also allow us to implement mechanisms to run specific portions of
the test instead of always having to run the whole thing just to test a
couple of lines down at the bottom (or have to creatively edit the test).
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Danny Al-Gaaf [Tue, 24 Jun 2014 17:54:17 +0000 (19:54 +0200)]
test/ceph-disk.sh: fix for SUSE
On SUSE 'which' returns always the full path of (shell) commands and
not e.g. './ceph-conf' as on Debian. Add check also for full
path returned by which.
Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
Danny Al-Gaaf [Wed, 25 Jun 2014 07:56:52 +0000 (09:56 +0200)]
osdmaptool/test-map-pgs.t: fix escaping to fix run
Run failed always running into the '|| cat $OUT' case due
to bad escaping of '\t'. This is caused by different shells
on different distros (e.g. bash on SUSE vs dash on Ubuntu).
Use 'grep -P ' and fix the regex to make it shell independet.
Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
Accela Zhao [Wed, 18 Jun 2014 09:17:03 +0000 (17:17 +0800)]
Make <poolname> in "ceph osd tier --help" clearer.
The ceph osd tier --help info on the left always says <poolname>.
It is unclear which one to put <tierpool> on the right.
$ceph osd tier --help
osd tier add <poolname> <poolname> {-- add the tier <tierpool> to base pool
force-nonempty} <pool>
osd tier add-cache <poolname> add a cache <tierpool> of size <size>
<poolname> <int[0-]> to existing pool <pool>
...
This patch modifies description on the right to tell which <poolname>:
osd tier add <poolname> <poolname> {-- add the tier <tierpool> (the second
force-nonempty} one) to base pool <pool> (the first
one)
...
Loic Dachary [Wed, 25 Jun 2014 16:07:47 +0000 (18:07 +0200)]
osd: workaround race condition in tests
Trying to "ceph tell" a newly created OSD sometime triggers an
ENXIO. The OSD creation function used for test scripts waits for the OSD
to report that it is up (according to ceph osd tree) before
returning. That reduces (maybe close ?) the condition that triggers the
error so that the tests do not randomly fail.