Jim Schutt [Wed, 1 Feb 2012 15:54:25 +0000 (08:54 -0700)]
common/Throttle: throttle in FIFO order
Under heavy write load from many clients, many reader threads will
be waiting in the policy throttler, all on a single condition variable.
When a wakeup is signalled, any of those threads may receive the
signal. This increases the variance in the message processing
latency, and in extreme cases can significantly delay a message.
This patch causes threads to exit a throttler in the same order
they entered.
Signed-off-by: Jim Schutt <jaschut@sandia.gov> Signed-off-by: Greg Farnum <gregory.farnum@dreamhost.com>
Greg Farnum [Wed, 1 Feb 2012 21:25:37 +0000 (13:25 -0800)]
osd: add check_ops_in_flight()
By default it warns on requests that are more than 30 seconds old,
using an exponential backoff of that interval.
Also add state name retrieval to OpRequest.
Greg Farnum [Mon, 30 Jan 2012 22:50:28 +0000 (14:50 -0800)]
osd: "mark" OpRequests as they move through the system.
Right now these are just informational flags which can be read out. Later
they might extend to timing information, separate lists for more precise
control over latency warnings, etc.
Sage Weil [Tue, 31 Jan 2012 21:00:45 +0000 (13:00 -0800)]
qa: test_backfill.sh: take osd.0 down
Mark this down to
1- trigger the WaitActingChange vs osd down race, and
2- help trigger a divergnet log when osd.2 is blackholed+restarted during
backfill. e.g.,
Sage Weil [Tue, 31 Jan 2012 17:53:32 +0000 (09:53 -0800)]
osd: restart peering if requesting acting osd goes down
If we request an acting set, we need to restart peering if one of the
requested nodes goes down. This prevents a deadlock where we get stuck
in WaitActingChange because we have [a,b], want [a,b,c], but c is down and
our up and acting don't actually change.
Sage Weil [Tue, 31 Jan 2012 15:25:04 +0000 (07:25 -0800)]
osd: fix divergent backfill targets
During peering, a previous backfill target may have a slightly newer
last_update than the other options, but it will not be chosen because it
is incomplete. That caused a failed assert during activate() (#1983).
To fix, we remove the bad assert, and then fix merge_log() so that the
replica/backfill target will trim its divergent entries when it gets the
activation MLogRec. We also fix the handling of MInfoRec, as that can
trigger the same analogous condition.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Tue, 31 Jan 2012 01:39:23 +0000 (17:39 -0800)]
filestore: implement filestore_blackhole hook
If true, we'll drop any new transactions on the floor. Useful for
triggering failure conditions (e.g., prior to killing ceph-osd itself, to
ensure some operations don't reach the local disk).
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Mon, 30 Jan 2012 22:27:24 +0000 (14:27 -0800)]
osd: disable clone overlap for push/pull
There is a bug in the push/pull code. Disable the recovery smarts by
default until we fix #2002.
There is currently a race (in the callers) where:
- an adjacent clone is missing
- we (calculate some clone overlap? and) start pulling
- we get adjacent clone
- we get push, calc a different overlap, and then get confused.
Sage Weil [Mon, 30 Jan 2012 04:54:18 +0000 (20:54 -0800)]
qa: test/rados-api/list fix warning
warning: test/rados-api/list.cc:43:156: converting ‘false’ to pointer type for argument 1 of ‘char testing::internal::IsNullLiteralHelper(testing::internal::Secret*)’ [-Wconversion-null]
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Mon, 30 Jan 2012 04:36:46 +0000 (20:36 -0800)]
test_ipaddr: reverse ASSERT_EQ order
Make these warnings go away:
warning: test/test_ipaddr.cc:217:156: converting ‘false’ to pointer type for argument 1 of ‘char testing::internal::IsNullLiteralHelper(testing::internal::Secret*)’ [-Wconversion-null]
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Sun, 29 Jan 2012 17:26:28 +0000 (09:26 -0800)]
mon: trim old auth states
These aren't exposed outside the monitor, so we really only keep them
around to assist in mon recovery. Give ourselves a healthy margin over
the max join drift for that.
Fixes: #2000 Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Thu, 26 Jan 2012 00:37:34 +0000 (16:37 -0800)]
filejournal: assume gibberish flags imply none
Old journals didn't properly initialize the flags (oops). Assume that
any bits besides the first 2 imply no flags.
Make note that this hack needs to be removed after some time has passed,
but well before these new flags are used. Or, such use should be
accompanied by a full header format rev and incompatibility.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Thu, 26 Jan 2012 00:36:17 +0000 (16:36 -0800)]
filejournal: include crc in entry header/footer
Use the unused flags field for this. Previously it was always 0, so this
lets us skip old entries on old journals and only worry about missing one
out of 2^32 corruptions. New journals get a flag that strictly enforces
the crc check.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Fri, 27 Jan 2012 21:21:39 +0000 (13:21 -0800)]
mon: mark pgs stale in pg_map if primary osd is down
This alerts the administrator when all OSDs for a PG have failed and the
monitor doesn't receive any further updates. Otherwise we may continue
to think a pg is active+clean when it is in fact offline.
Fixes: #1993 Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Fri, 27 Jan 2012 18:41:50 +0000 (10:41 -0800)]
filestore: dump offending transaction on any error
Clean this code up to explicitly whitelist what is ok so that the flow is
less annoying to follow/maintain, and so that we dump the transaction
contents on whitelisted errors.
Signed-off-by: Sage Weil <sage@newdream.net> Reviewed-by: Josh Durgin <josh.durgin@dreamhost.com>
Sage Weil [Fri, 27 Jan 2012 18:39:49 +0000 (10:39 -0800)]
objecter: fix bounds checking on op reply demuxing
We can't assume that the size of out_ops (from the reply) matches the
op->out_* vectors from our request state. In particular, the out_ops might
be shorter than what we sent the OSD if the OSD was sloppy. Check them.
We can assume that op->ops and op->out_* all match; assert as much in
op_submit().
Fixes: #1986 Signed-off-by: Sage Weil <sage.weil@dreamhost.com> Reviewed-by: Greg Farnum <gregory.farnum@dreamhost.com>
Sage Weil [Wed, 25 Jan 2012 20:38:06 +0000 (12:38 -0800)]
osd: remove num_kb from object_stat_sum_t stats
This is redundant--we can just use num_bytes. If we're worried about the
per-object overhead or rounding, we can factor in some overhead based on
num_objects.
And, the kb accounting has a bug (#1988).
Avoid changing the encoding at all for now. Next time the encoding changes
we'll drop the old field.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Wed, 25 Jan 2012 06:03:51 +0000 (22:03 -0800)]
osd: track obc for clone from log replay
We need to keep an in-memory obc to track the state of the in-flight io
to disk. This is analogous to when an object is pushed + written, and we
can share the same completion function.
Alexandre Oliva [Tue, 17 Jan 2012 19:22:17 +0000 (17:22 -0200)]
package *.py* files
Some post-install rpmbuild defaults byte-compile all packaged python
files, so don't bother removing the .pyc files, and package .py* to
get both .pyo and .pyc. It wastes a tiny little bit of space, but it
makes the spec file portable across a wider range of rpm and python
configurations.
Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicam.br> Signed-off-by: Sage Weil <sage@newdream.net>
Josh Durgin [Wed, 25 Jan 2012 00:52:27 +0000 (16:52 -0800)]
librbd: don't infinite loop when header is too large
Since snapshots are currently stored at the end of the header, having
many snapshots made the header larger than the read size, resulting in
an infinite loop when the offset was not changed.
Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com> Reviewed-by: Samuel Just <samuel.just@dreamhost.com>
Sage Weil [Mon, 23 Jan 2012 18:21:04 +0000 (10:21 -0800)]
osd: ignore MInfoRec, MNotifyRec in WaitActingChange
We should ignore logs, infos, and notifies while we are waiting for the
map to change. Peering has reached a dead-end (we need acting to change)
and we will redo our work when that happens. That includes the replicas
resending notifies.
Fixes: #1958 Signed-off-by: Sage Weil <sage.weil@dreamhost.com> Reviewed-by: Samuel Just <samuel.just@dreamhost.com>