Sage Weil [Sat, 22 Dec 2012 00:47:50 +0000 (16:47 -0800)]
osd: fix pg stat msgs vs timeout
We can get a pattern like so:
- new mon session
- after say 120 seconds, we decide to send a stats msg
- outstanding_pg_stats is finally true, we immediately time out (30 second
grace), and reconnect to a new mon
-> repeat
The problem is that we don't reset the last_sent timestamp when we send.
Or that we do this check after sending instead of before. Fix both.
This should resolve the issue #3661 where osds that don't have pgs
updating are not stats messags to the mon to check in, and are eventually
getting marked down as a result.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
Sage Weil [Fri, 21 Dec 2012 21:44:19 +0000 (13:44 -0800)]
monc: only warn about missing keyring if we fail to authenticate
This avoids the situation where a librados or other user with the default
of 'cephx,none' and no keyring is authenticating against a cluster with
required of 'none' and an annoying warning is generated every time. Now
we only print a helpful message if we actually failed.
Sage Weil [Fri, 21 Dec 2012 06:01:34 +0000 (22:01 -0800)]
osd: clear scrub state if queued scrub doesn't start
We set SCRUBBING when we queue a pg for scrub. If we dequeue and
call scrub() but abort for some reason (!active, degraded, etc.), clear
that state bit.
Bug is easily reproduced with 'ceph osd scrub N' during cluster startup
when PGs are peering; some PGs can get left in the scrubbing state.
Add ceph osd ls to help; make help for ceph osd tell N bench look
more like injectargs, which says <osd-id or *> to make it clear you
can benchmark all osds simultaneously
Sage Weil [Thu, 20 Dec 2012 21:48:06 +0000 (13:48 -0800)]
log: fix flush/signal race
We need to signal the cond in the same interval where we hold the lock
*and* modify the queue. Otherwise, we can have a race like:
queue has 1 item, max is 1.
A: enter submit_entry, signal cond, wait on condition
B: enter submit_entry, signal cond, wait on condition
C: flush wakes up, flushes 1 previous item
A: retakes lock, enqueues something, exits
B: retakes lock, condition fails, waits
-> C is never woken up as there are 2 items waiting
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Dan Mick <dan.mick@inktank.com>
Samuel Just [Thu, 20 Dec 2012 21:23:27 +0000 (13:23 -0800)]
OSD,ReplicatedPG: do not track notifies on the session
handle_notify_timeout and remove_notify currently do not clean up this
state leaving dangling Notification*. Further, we only use this mapping
in unwatch in order to determine which notifies to update. We can
accomplish the same thing by iterating through the obc->notifs mapping
since all notifications relevant for a given watch would have been for
the same obc as the watch.
Yehuda Sadeh [Wed, 19 Dec 2012 18:21:57 +0000 (10:21 -0800)]
rgw: don't try to assign content type if not found
Fixes: #3648
Cannot assign a NULL pointer into stl string. This is only
relevant to swift, when uploading an object without specifying
content type, and when the suffix cannot be determined.
Yehuda Sadeh [Thu, 20 Dec 2012 00:59:43 +0000 (16:59 -0800)]
rgw: don't initialize keystone if not set up
Fixes: #3653
No need to initialize keystone, including the keystone
revocation thread which was verbose if key stone was
not set up. This removes some unuseful errors from the
log.
Yehuda Sadeh [Wed, 19 Dec 2012 22:34:53 +0000 (14:34 -0800)]
rgw: remove useless configurable, fix swift auth error handling
Fixes: #3649
No need to have an extra configurable to use keystone. Use keystone
whenever keystone url has been specified. Also, fix a bad error
handling that turned a failure to authenticate into successfully
authenticating a bad user.
Sage Weil [Wed, 19 Dec 2012 03:21:24 +0000 (19:21 -0800)]
ceph: report error string to stderr, not stdout
If we return an error, send the message to stderr. This makes things
more easily scriptable because error messages won't take the place of
expected output.
mon: OSDMonitor: add option 'mon_max_pool_pg_num' and limit 'pg_num' accordingly
Instead of having a hardcoded default, use a configurable one. It is
limited to 65536 until future testing guarantees there is no side-effects
of increasing it past this value, but by being adjustable the user still
has the freedom to specify whatever maximum value he wants.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Gary Lowell [Sun, 16 Dec 2012 06:38:58 +0000 (22:38 -0800)]
Makefiles: Two new packages needed in the debian build depdencies.
The ceph test programs that are now being built by default require the junit
and libboost-program-options packages. These have been added to the build
dependecies in the debian control file.
Signed-off-by: Gary Lowell <gary.lowell@inktank.com>
James Page [Wed, 12 Dec 2012 22:06:16 +0000 (22:06 +0000)]
Refactor rule file to separate arch/indep builds.
Prior to the ceph fs java bindings, all packages where
architecture depdendent so the packaging rules file
worked OK; this fixes up the binary-indep/arch targets
to split the builds of architecture dependent and
independent files.
Sage Weil [Sun, 16 Dec 2012 01:45:25 +0000 (17:45 -0800)]
osdc/Objecter: prevent pool dne check from invalidating scan_requests iterator
We iterate over ops and, if the pool dne and other conditions are true,
we will immediately return ENOENT and cancel an op. Increment the
iterator at the top of the loop to avoid invalidating it.
We also need to switch to a map<>, because hash_map<> mutations may
invalidate any/all iterators.
Fixes: #3613 Signed-off-by: Sage Weil <sage@inktank.com>
Greg Farnum [Fri, 14 Dec 2012 22:34:35 +0000 (14:34 -0800)]
qa: add a workunit for fsync-tester
It turns out that our suites don't exercise fsync, at least not very much
(I couldn't find it in all the places I looked for it). This tester
was written by Ted T'so and updated by Chris Mason; I just made it
work on a smaller dataset (256MB) because 8GB against a small cluster takes
more time than we want to wait.
Alex Elder [Fri, 14 Dec 2012 21:58:39 +0000 (15:58 -0600)]
map-unmap.sh: use udevadm settle for synchronization
This script was heuristically using short sleep commands in order to
give udev activity time to complete.
There's a command "udevadm settle" which actually looks at the udev
queue and waits until its processing is done. Much, much better.
This rearranges the get_id function a bit too, breaking it into one
function that gets the id and another that loops back and tries
again after a short delay in the event the get_id fails.
Samuel Just [Fri, 14 Dec 2012 20:46:43 +0000 (12:46 -0800)]
ReplicatedPG: use default priority for Backfill messages
Backfill messages modify the stats on the replica and therefore
must be sent with the same priority as sub_op_modify to ensure
ordering. Using recovery_op_priority caused the following
sequence:
1) Primary(1) sends MOSDPGBackfill FINISH with updated stats (v1)
2) Primary(1) sends SubOp modify for new client op with stats (v2)
3) Replica(2) receives SubOp with stats (v2)
4) Replica(2) receives MOSDPGBackfill FINISH with stats (v1)
5) Replica(2) responds and Primary(1) resets pgtemp making
Replica(2) Primary(2)
6) PG stats on Primary(2) several ops old.
Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Samuel Just [Fri, 14 Dec 2012 20:43:08 +0000 (12:43 -0800)]
ReplicatedPG: do not use priority from client op
There are internal ordering requirements which may be sensitive
to assigned priority. We don't want a mix of priorities from
old clients with priorities from new clients causing trouble.
Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Sam Lang [Fri, 14 Dec 2012 03:23:27 +0000 (17:23 -1000)]
client: Add config option to inject sleep for tick
Testing the tick delay with a fork/suspend is causing
corruption in the lockdep code. This approach uses
a config option to sleep the tick thread for a number
of seconds, avoiding the entire fork/suspend mess.
Josh Durgin [Tue, 11 Dec 2012 06:34:05 +0000 (22:34 -0800)]
rbd.py: check for new librbd methods before use
This way attempting to use format 2 images works when you upgrade the
python bindings before librbd, and attempting to use functions
that librbd does not have results in more understandable errors.
Sage Weil [Fri, 14 Dec 2012 00:26:43 +0000 (16:26 -0800)]
osd: up != acting okay on mkpg
This can happen when:
- mon sends create pg
- it gets created
- osd remaps the pg to a different osd
but osd does not update pg status to the mon
- mkpg resent to the new osd
or something along those lines. It seems unusual, but in the end who
really cares why the mon doesn't know about the pg creation yet.
Note that this check was added in the initial commit where acting/up was
added; there is no specific condition of concern we are protecting against.
Instead, ignore the message. We'll get a query soon anwyay.
This 'fixes' #3614.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
mon: OSDMonitor: don't allow creation of pools with > 65535 pgs
There are some limitations to the number of possible pg's per pool, and
by allowing the 'osd pool create' command to succeed, we were making room
to some anomalous behavior.
Dan Mick [Thu, 13 Dec 2012 22:06:17 +0000 (14:06 -0800)]
rbd: handle images disappearing while in ls -l
rbd.list() returns a list of names, but nothing stops them from
going away before rbd.open(); check for ENOENT and ignore if that
happens; warn on other errors
Signed-off-by: Dan Mick <dan.mick@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Samuel Just [Thu, 13 Dec 2012 18:52:28 +0000 (10:52 -0800)]
OSD: put connection in disconnect_session_watches as well as the session
obc->watchers now has a ref to the connection as well. This piece of
disconnect_session_watchers essentially parallels remove_watcher and
should generally do the same thing.
Sam Lang [Thu, 13 Dec 2012 00:28:12 +0000 (14:28 -1000)]
filestore: Don't keep checking for syncfs if found
Valgrind outputs a warning for unrecognized system calls,
and does so for the syscall(__SYS_syncfs,...) and
syscall(__NR_syncfs, ...) calls. This patch avoids making
those calls (and the warning, when run in valgrind) if the
syncfs libc call is available.
INFO:teuthology.task.ceph.osd.1.err:--10568-- WARNING: unhandled syscall: 306
INFO:teuthology.task.ceph.osd.1.err:--10568-- You may be able to write your own handler.
INFO:teuthology.task.ceph.osd.1.err:--10568-- Read the file README_MISSING_SYSCALL_OR_IOCTL.
INFO:teuthology.task.ceph.osd.1.err:--10568-- Nevertheless we consider this a bug. Please report
INFO:teuthology.task.ceph.osd.1.err:--10568-- it at http://valgrind.org/support/bug_reports.html.
Samuel Just [Wed, 12 Dec 2012 06:22:31 +0000 (22:22 -0800)]
PG,ReplicatedPG: handle_watch_timeout must not write during scrub/degraded
Currently, handle_watch_timeout will gladly write to an object while
that object is degraded or is being scrubbed. Now, we queue a
callback to be called on scrub completion or finish_degraded_object
to recall handle_watch_timeout. The callback mechanism assumes that
the registered callbacks assume they will be called with the pg
lock -- and no other locks -- already held.
The callback will release the obc and pg refs unconditionally. Thus,
we need to replace the unconnected_watchers pointer with NULL to
ensure that unregister_unconnected_watcher fails to cancel the
event and does not release the resources a second time.
Yehuda Sadeh [Fri, 30 Nov 2012 00:48:46 +0000 (16:48 -0800)]
rgw: option to provide alternative s3 put obj success code
Fixes: #3529
Added a new option: rgw_s3_success_create_obj_status.
Expected values are 0, 200, 201, 204. A value of 0
will skip the special handling altogether. Any value
other than the specified will default to 200.
Sage Weil [Wed, 12 Dec 2012 01:15:56 +0000 (17:15 -0800)]
os/JournalingObjectStore: un-break op quiescing during journal replay
Commit d9dce4e9273adb4279519d65a0d8bfdfecb5c516 broke journal replay
because the commit thread may try to do a commit, and the ops are not
being applied via the normal work queue. Add back in a simpler form of the
old op quiescing (simpler because there is a single thread doing the
replay).
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
Josh Durgin [Tue, 11 Dec 2012 20:26:21 +0000 (12:26 -0800)]
st_rados_watch: tolerate extra notifies
With retries, it's possible for notifies to be received more than once
when they are resent to different OSDs, since the OSDs only track them
in memory.