mon: OSDMonitor: add option 'mon_max_pool_pg_num' and limit 'pg_num' accordingly
Instead of having a hardcoded default, use a configurable one. It is
limited to 65536 until future testing guarantees there is no side-effects
of increasing it past this value, but by being adjustable the user still
has the freedom to specify whatever maximum value he wants.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Sage Weil [Sun, 16 Dec 2012 01:45:25 +0000 (17:45 -0800)]
osdc/Objecter: prevent pool dne check from invalidating scan_requests iterator
We iterate over ops and, if the pool dne and other conditions are true,
we will immediately return ENOENT and cancel an op. Increment the
iterator at the top of the loop to avoid invalidating it.
We also need to switch to a map<>, because hash_map<> mutations may
invalidate any/all iterators.
Fixes: #3613 Signed-off-by: Sage Weil <sage@inktank.com>
Greg Farnum [Fri, 14 Dec 2012 22:34:35 +0000 (14:34 -0800)]
qa: add a workunit for fsync-tester
It turns out that our suites don't exercise fsync, at least not very much
(I couldn't find it in all the places I looked for it). This tester
was written by Ted T'so and updated by Chris Mason; I just made it
work on a smaller dataset (256MB) because 8GB against a small cluster takes
more time than we want to wait.
Alex Elder [Fri, 14 Dec 2012 21:58:39 +0000 (15:58 -0600)]
map-unmap.sh: use udevadm settle for synchronization
This script was heuristically using short sleep commands in order to
give udev activity time to complete.
There's a command "udevadm settle" which actually looks at the udev
queue and waits until its processing is done. Much, much better.
This rearranges the get_id function a bit too, breaking it into one
function that gets the id and another that loops back and tries
again after a short delay in the event the get_id fails.
Samuel Just [Fri, 14 Dec 2012 20:46:43 +0000 (12:46 -0800)]
ReplicatedPG: use default priority for Backfill messages
Backfill messages modify the stats on the replica and therefore
must be sent with the same priority as sub_op_modify to ensure
ordering. Using recovery_op_priority caused the following
sequence:
1) Primary(1) sends MOSDPGBackfill FINISH with updated stats (v1)
2) Primary(1) sends SubOp modify for new client op with stats (v2)
3) Replica(2) receives SubOp with stats (v2)
4) Replica(2) receives MOSDPGBackfill FINISH with stats (v1)
5) Replica(2) responds and Primary(1) resets pgtemp making
Replica(2) Primary(2)
6) PG stats on Primary(2) several ops old.
Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Samuel Just [Fri, 14 Dec 2012 20:43:08 +0000 (12:43 -0800)]
ReplicatedPG: do not use priority from client op
There are internal ordering requirements which may be sensitive
to assigned priority. We don't want a mix of priorities from
old clients with priorities from new clients causing trouble.
Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Sam Lang [Fri, 14 Dec 2012 03:23:27 +0000 (17:23 -1000)]
client: Add config option to inject sleep for tick
Testing the tick delay with a fork/suspend is causing
corruption in the lockdep code. This approach uses
a config option to sleep the tick thread for a number
of seconds, avoiding the entire fork/suspend mess.
Josh Durgin [Tue, 11 Dec 2012 06:34:05 +0000 (22:34 -0800)]
rbd.py: check for new librbd methods before use
This way attempting to use format 2 images works when you upgrade the
python bindings before librbd, and attempting to use functions
that librbd does not have results in more understandable errors.
Sage Weil [Fri, 14 Dec 2012 00:26:43 +0000 (16:26 -0800)]
osd: up != acting okay on mkpg
This can happen when:
- mon sends create pg
- it gets created
- osd remaps the pg to a different osd
but osd does not update pg status to the mon
- mkpg resent to the new osd
or something along those lines. It seems unusual, but in the end who
really cares why the mon doesn't know about the pg creation yet.
Note that this check was added in the initial commit where acting/up was
added; there is no specific condition of concern we are protecting against.
Instead, ignore the message. We'll get a query soon anwyay.
This 'fixes' #3614.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
mon: OSDMonitor: don't allow creation of pools with > 65535 pgs
There are some limitations to the number of possible pg's per pool, and
by allowing the 'osd pool create' command to succeed, we were making room
to some anomalous behavior.
Dan Mick [Thu, 13 Dec 2012 22:06:17 +0000 (14:06 -0800)]
rbd: handle images disappearing while in ls -l
rbd.list() returns a list of names, but nothing stops them from
going away before rbd.open(); check for ENOENT and ignore if that
happens; warn on other errors
Signed-off-by: Dan Mick <dan.mick@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Samuel Just [Thu, 13 Dec 2012 18:52:28 +0000 (10:52 -0800)]
OSD: put connection in disconnect_session_watches as well as the session
obc->watchers now has a ref to the connection as well. This piece of
disconnect_session_watchers essentially parallels remove_watcher and
should generally do the same thing.
Sam Lang [Thu, 13 Dec 2012 00:28:12 +0000 (14:28 -1000)]
filestore: Don't keep checking for syncfs if found
Valgrind outputs a warning for unrecognized system calls,
and does so for the syscall(__SYS_syncfs,...) and
syscall(__NR_syncfs, ...) calls. This patch avoids making
those calls (and the warning, when run in valgrind) if the
syncfs libc call is available.
INFO:teuthology.task.ceph.osd.1.err:--10568-- WARNING: unhandled syscall: 306
INFO:teuthology.task.ceph.osd.1.err:--10568-- You may be able to write your own handler.
INFO:teuthology.task.ceph.osd.1.err:--10568-- Read the file README_MISSING_SYSCALL_OR_IOCTL.
INFO:teuthology.task.ceph.osd.1.err:--10568-- Nevertheless we consider this a bug. Please report
INFO:teuthology.task.ceph.osd.1.err:--10568-- it at http://valgrind.org/support/bug_reports.html.
Samuel Just [Wed, 12 Dec 2012 06:22:31 +0000 (22:22 -0800)]
PG,ReplicatedPG: handle_watch_timeout must not write during scrub/degraded
Currently, handle_watch_timeout will gladly write to an object while
that object is degraded or is being scrubbed. Now, we queue a
callback to be called on scrub completion or finish_degraded_object
to recall handle_watch_timeout. The callback mechanism assumes that
the registered callbacks assume they will be called with the pg
lock -- and no other locks -- already held.
The callback will release the obc and pg refs unconditionally. Thus,
we need to replace the unconnected_watchers pointer with NULL to
ensure that unregister_unconnected_watcher fails to cancel the
event and does not release the resources a second time.
Yehuda Sadeh [Fri, 30 Nov 2012 00:48:46 +0000 (16:48 -0800)]
rgw: option to provide alternative s3 put obj success code
Fixes: #3529
Added a new option: rgw_s3_success_create_obj_status.
Expected values are 0, 200, 201, 204. A value of 0
will skip the special handling altogether. Any value
other than the specified will default to 200.
Sage Weil [Wed, 12 Dec 2012 01:15:56 +0000 (17:15 -0800)]
os/JournalingObjectStore: un-break op quiescing during journal replay
Commit d9dce4e9273adb4279519d65a0d8bfdfecb5c516 broke journal replay
because the commit thread may try to do a commit, and the ops are not
being applied via the normal work queue. Add back in a simpler form of the
old op quiescing (simpler because there is a single thread doing the
replay).
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
Josh Durgin [Tue, 11 Dec 2012 20:26:21 +0000 (12:26 -0800)]
st_rados_watch: tolerate extra notifies
With retries, it's possible for notifies to be received more than once
when they are resent to different OSDs, since the OSDs only track them
in memory.
Yehuda Sadeh [Tue, 11 Dec 2012 21:41:50 +0000 (13:41 -0800)]
mds: shutdown cleanly if can't authenticate
Fixes: #3590
This was triggered when tried to run mds with cephx enabled
against a mon without cephx support. We didn't handle the
returned error at all, so this one fixes it. It also makes
sure that we don't continue initialization until rotating
keys are in place (as the osd does).
Josh Durgin [Tue, 11 Dec 2012 17:54:44 +0000 (09:54 -0800)]
objecter: don't use new tid when retrying notifies
Watches update the on-disk state in the OSD, and aren't idempotent,
so refreshing them must be treated as a separate transaction by the OSD.
Notifies are just in-memory state, and resending them will result in
acceptable behavior:
- if it's the same osd, the resent op will be recognized as a duplicate
- if it's a different osd, a new notify will be triggered since the new osd
can't tell whether the original notify was received by any watchers
Using a new tid for each resend can cause some unecessary extra work,
as the first case turns into the second.
Rename operation can call predirty_journal_parents() several times.
So a directory fragment's rstat can also be modified several times.
But only the first modification is journaled because EMetaBlob::add_dir()
does not update existing dirlump.
For example: when hanlding 'mv a/b/c a/c', Server::_rename_prepare may
first decrease directory a and b's nested files count by one, then
increases directory a's nested files count by one.
Sage Weil [Tue, 11 Dec 2012 00:41:19 +0000 (16:41 -0800)]
config: do not always print config file missing errors
Do not generate errors each time we fail to open a config file; only
generate one at the end if a search path was specified and none were
usable, right before we (already) exit. This avoids spamming stderr
about each path we tried in the search list before we found a good one.
Samuel Just [Mon, 10 Dec 2012 21:38:24 +0000 (13:38 -0800)]
config_opts.h: adjust recovery defaults
osd max backfills: 5 was too low for a default, 10
seems to work better in testing. The message
priority system should minimize disruption of
push and pull operations anyway.
osd recovery max chunk: 1MB was too small for a
default. 8MB is reasonable for a single push
and will allow us to recover an rbd block in
one push rather then 4 reducing client io
latency during log-based recovery.
osd recovery op priority: 10 rather than 30 will
further reduce the client io latency impact of
push and pull operations.
Sage Weil [Sun, 9 Dec 2012 05:44:54 +0000 (21:44 -0800)]
mon: fix leak of pool op reply data
We pass a pointer because it is an optional argument, but we shouldn't
put the bufferlist on the heap or else we have to manage it's life
cycle, and that's fragile (and previously broken).
Sage Weil [Fri, 7 Dec 2012 00:18:07 +0000 (16:18 -0800)]
filestore: simplify op quescing
The delicate balancing with op_apply_start() and that fact that it can
block was making it very hard to determine how long commit_start() should
wait, since requests in the workqueue threads could op_apply_start() in
any order. For example,
threadA: gets osr1 from wq
threadA: gets osr2 from wq
threadA: dequeue seq 11 from osr1, op_apply_start
threadC: commit_start on 11
threadA: op_apply_finish on seq 11
threadC: commit_started, commit_finish
threadB: dequeue seq 10 from osr2
<failed assert, badness>
Instead, rip out all this code, and use the ThreadPool pause() method to
quiesce operations. Keep some of the (now unnecessary) fields around
for sanity checks (blocked, open_ops, max_applying_seq, etc.).
Samuel Just [Tue, 4 Dec 2012 19:36:58 +0000 (11:36 -0800)]
PG: remove last_epoch_started asserts in proc_primary_info
These asserts are valid for a uniform cluster, but they won't hold
for a replica running a version without the info.last_epoch_started
patch.
Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 0756052cff542ab02d653b40c37a645b395f31b3)