Add ceph osd ls to help; make help for ceph osd tell N bench look
more like injectargs, which says <osd-id or *> to make it clear you
can benchmark all osds simultaneously
Sage Weil [Thu, 20 Dec 2012 21:48:06 +0000 (13:48 -0800)]
log: fix flush/signal race
We need to signal the cond in the same interval where we hold the lock
*and* modify the queue. Otherwise, we can have a race like:
queue has 1 item, max is 1.
A: enter submit_entry, signal cond, wait on condition
B: enter submit_entry, signal cond, wait on condition
C: flush wakes up, flushes 1 previous item
A: retakes lock, enqueues something, exits
B: retakes lock, condition fails, waits
-> C is never woken up as there are 2 items waiting
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Dan Mick <dan.mick@inktank.com>
Yehuda Sadeh [Wed, 19 Dec 2012 18:21:57 +0000 (10:21 -0800)]
rgw: don't try to assign content type if not found
Fixes: #3648
Cannot assign a NULL pointer into stl string. This is only
relevant to swift, when uploading an object without specifying
content type, and when the suffix cannot be determined.
Yehuda Sadeh [Thu, 20 Dec 2012 00:59:43 +0000 (16:59 -0800)]
rgw: don't initialize keystone if not set up
Fixes: #3653
No need to initialize keystone, including the keystone
revocation thread which was verbose if key stone was
not set up. This removes some unuseful errors from the
log.
Yehuda Sadeh [Wed, 19 Dec 2012 22:34:53 +0000 (14:34 -0800)]
rgw: remove useless configurable, fix swift auth error handling
Fixes: #3649
No need to have an extra configurable to use keystone. Use keystone
whenever keystone url has been specified. Also, fix a bad error
handling that turned a failure to authenticate into successfully
authenticating a bad user.
Sage Weil [Wed, 19 Dec 2012 03:21:24 +0000 (19:21 -0800)]
ceph: report error string to stderr, not stdout
If we return an error, send the message to stderr. This makes things
more easily scriptable because error messages won't take the place of
expected output.
mon: OSDMonitor: add option 'mon_max_pool_pg_num' and limit 'pg_num' accordingly
Instead of having a hardcoded default, use a configurable one. It is
limited to 65536 until future testing guarantees there is no side-effects
of increasing it past this value, but by being adjustable the user still
has the freedom to specify whatever maximum value he wants.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Gary Lowell [Sun, 16 Dec 2012 06:38:58 +0000 (22:38 -0800)]
Makefiles: Two new packages needed in the debian build depdencies.
The ceph test programs that are now being built by default require the junit
and libboost-program-options packages. These have been added to the build
dependecies in the debian control file.
Signed-off-by: Gary Lowell <gary.lowell@inktank.com>
James Page [Wed, 12 Dec 2012 22:06:16 +0000 (22:06 +0000)]
Refactor rule file to separate arch/indep builds.
Prior to the ceph fs java bindings, all packages where
architecture depdendent so the packaging rules file
worked OK; this fixes up the binary-indep/arch targets
to split the builds of architecture dependent and
independent files.
Sage Weil [Sun, 16 Dec 2012 01:45:25 +0000 (17:45 -0800)]
osdc/Objecter: prevent pool dne check from invalidating scan_requests iterator
We iterate over ops and, if the pool dne and other conditions are true,
we will immediately return ENOENT and cancel an op. Increment the
iterator at the top of the loop to avoid invalidating it.
We also need to switch to a map<>, because hash_map<> mutations may
invalidate any/all iterators.
Fixes: #3613 Signed-off-by: Sage Weil <sage@inktank.com>
Greg Farnum [Fri, 14 Dec 2012 22:34:35 +0000 (14:34 -0800)]
qa: add a workunit for fsync-tester
It turns out that our suites don't exercise fsync, at least not very much
(I couldn't find it in all the places I looked for it). This tester
was written by Ted T'so and updated by Chris Mason; I just made it
work on a smaller dataset (256MB) because 8GB against a small cluster takes
more time than we want to wait.
Alex Elder [Fri, 14 Dec 2012 21:58:39 +0000 (15:58 -0600)]
map-unmap.sh: use udevadm settle for synchronization
This script was heuristically using short sleep commands in order to
give udev activity time to complete.
There's a command "udevadm settle" which actually looks at the udev
queue and waits until its processing is done. Much, much better.
This rearranges the get_id function a bit too, breaking it into one
function that gets the id and another that loops back and tries
again after a short delay in the event the get_id fails.
Samuel Just [Fri, 14 Dec 2012 20:46:43 +0000 (12:46 -0800)]
ReplicatedPG: use default priority for Backfill messages
Backfill messages modify the stats on the replica and therefore
must be sent with the same priority as sub_op_modify to ensure
ordering. Using recovery_op_priority caused the following
sequence:
1) Primary(1) sends MOSDPGBackfill FINISH with updated stats (v1)
2) Primary(1) sends SubOp modify for new client op with stats (v2)
3) Replica(2) receives SubOp with stats (v2)
4) Replica(2) receives MOSDPGBackfill FINISH with stats (v1)
5) Replica(2) responds and Primary(1) resets pgtemp making
Replica(2) Primary(2)
6) PG stats on Primary(2) several ops old.
Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Samuel Just [Fri, 14 Dec 2012 20:43:08 +0000 (12:43 -0800)]
ReplicatedPG: do not use priority from client op
There are internal ordering requirements which may be sensitive
to assigned priority. We don't want a mix of priorities from
old clients with priorities from new clients causing trouble.
Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Sam Lang [Fri, 14 Dec 2012 03:23:27 +0000 (17:23 -1000)]
client: Add config option to inject sleep for tick
Testing the tick delay with a fork/suspend is causing
corruption in the lockdep code. This approach uses
a config option to sleep the tick thread for a number
of seconds, avoiding the entire fork/suspend mess.
Josh Durgin [Tue, 11 Dec 2012 06:34:05 +0000 (22:34 -0800)]
rbd.py: check for new librbd methods before use
This way attempting to use format 2 images works when you upgrade the
python bindings before librbd, and attempting to use functions
that librbd does not have results in more understandable errors.
Sage Weil [Fri, 14 Dec 2012 00:26:43 +0000 (16:26 -0800)]
osd: up != acting okay on mkpg
This can happen when:
- mon sends create pg
- it gets created
- osd remaps the pg to a different osd
but osd does not update pg status to the mon
- mkpg resent to the new osd
or something along those lines. It seems unusual, but in the end who
really cares why the mon doesn't know about the pg creation yet.
Note that this check was added in the initial commit where acting/up was
added; there is no specific condition of concern we are protecting against.
Instead, ignore the message. We'll get a query soon anwyay.
This 'fixes' #3614.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
mon: OSDMonitor: don't allow creation of pools with > 65535 pgs
There are some limitations to the number of possible pg's per pool, and
by allowing the 'osd pool create' command to succeed, we were making room
to some anomalous behavior.
Dan Mick [Thu, 13 Dec 2012 22:06:17 +0000 (14:06 -0800)]
rbd: handle images disappearing while in ls -l
rbd.list() returns a list of names, but nothing stops them from
going away before rbd.open(); check for ENOENT and ignore if that
happens; warn on other errors
Signed-off-by: Dan Mick <dan.mick@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Samuel Just [Thu, 13 Dec 2012 18:52:28 +0000 (10:52 -0800)]
OSD: put connection in disconnect_session_watches as well as the session
obc->watchers now has a ref to the connection as well. This piece of
disconnect_session_watchers essentially parallels remove_watcher and
should generally do the same thing.
Sam Lang [Thu, 13 Dec 2012 00:28:12 +0000 (14:28 -1000)]
filestore: Don't keep checking for syncfs if found
Valgrind outputs a warning for unrecognized system calls,
and does so for the syscall(__SYS_syncfs,...) and
syscall(__NR_syncfs, ...) calls. This patch avoids making
those calls (and the warning, when run in valgrind) if the
syncfs libc call is available.
INFO:teuthology.task.ceph.osd.1.err:--10568-- WARNING: unhandled syscall: 306
INFO:teuthology.task.ceph.osd.1.err:--10568-- You may be able to write your own handler.
INFO:teuthology.task.ceph.osd.1.err:--10568-- Read the file README_MISSING_SYSCALL_OR_IOCTL.
INFO:teuthology.task.ceph.osd.1.err:--10568-- Nevertheless we consider this a bug. Please report
INFO:teuthology.task.ceph.osd.1.err:--10568-- it at http://valgrind.org/support/bug_reports.html.
Samuel Just [Wed, 12 Dec 2012 06:22:31 +0000 (22:22 -0800)]
PG,ReplicatedPG: handle_watch_timeout must not write during scrub/degraded
Currently, handle_watch_timeout will gladly write to an object while
that object is degraded or is being scrubbed. Now, we queue a
callback to be called on scrub completion or finish_degraded_object
to recall handle_watch_timeout. The callback mechanism assumes that
the registered callbacks assume they will be called with the pg
lock -- and no other locks -- already held.
The callback will release the obc and pg refs unconditionally. Thus,
we need to replace the unconnected_watchers pointer with NULL to
ensure that unregister_unconnected_watcher fails to cancel the
event and does not release the resources a second time.
Yehuda Sadeh [Fri, 30 Nov 2012 00:48:46 +0000 (16:48 -0800)]
rgw: option to provide alternative s3 put obj success code
Fixes: #3529
Added a new option: rgw_s3_success_create_obj_status.
Expected values are 0, 200, 201, 204. A value of 0
will skip the special handling altogether. Any value
other than the specified will default to 200.
Sage Weil [Wed, 12 Dec 2012 01:15:56 +0000 (17:15 -0800)]
os/JournalingObjectStore: un-break op quiescing during journal replay
Commit d9dce4e9273adb4279519d65a0d8bfdfecb5c516 broke journal replay
because the commit thread may try to do a commit, and the ops are not
being applied via the normal work queue. Add back in a simpler form of the
old op quiescing (simpler because there is a single thread doing the
replay).
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
Josh Durgin [Tue, 11 Dec 2012 20:26:21 +0000 (12:26 -0800)]
st_rados_watch: tolerate extra notifies
With retries, it's possible for notifies to be received more than once
when they are resent to different OSDs, since the OSDs only track them
in memory.
Yehuda Sadeh [Tue, 11 Dec 2012 21:41:50 +0000 (13:41 -0800)]
mds: shutdown cleanly if can't authenticate
Fixes: #3590
This was triggered when tried to run mds with cephx enabled
against a mon without cephx support. We didn't handle the
returned error at all, so this one fixes it. It also makes
sure that we don't continue initialization until rotating
keys are in place (as the osd does).
Josh Durgin [Tue, 11 Dec 2012 17:54:44 +0000 (09:54 -0800)]
objecter: don't use new tid when retrying notifies
Watches update the on-disk state in the OSD, and aren't idempotent,
so refreshing them must be treated as a separate transaction by the OSD.
Notifies are just in-memory state, and resending them will result in
acceptable behavior:
- if it's the same osd, the resent op will be recognized as a duplicate
- if it's a different osd, a new notify will be triggered since the new osd
can't tell whether the original notify was received by any watchers
Using a new tid for each resend can cause some unecessary extra work,
as the first case turns into the second.
Rename operation can call predirty_journal_parents() several times.
So a directory fragment's rstat can also be modified several times.
But only the first modification is journaled because EMetaBlob::add_dir()
does not update existing dirlump.
For example: when hanlding 'mv a/b/c a/c', Server::_rename_prepare may
first decrease directory a and b's nested files count by one, then
increases directory a's nested files count by one.
Sage Weil [Tue, 11 Dec 2012 00:41:19 +0000 (16:41 -0800)]
config: do not always print config file missing errors
Do not generate errors each time we fail to open a config file; only
generate one at the end if a search path was specified and none were
usable, right before we (already) exit. This avoids spamming stderr
about each path we tried in the search list before we found a good one.