Greg Farnum [Wed, 1 Feb 2012 21:25:37 +0000 (13:25 -0800)]
osd: add check_ops_in_flight()
By default it warns on requests that are more than 30 seconds old,
using an exponential backoff of that interval.
Also add state name retrieval to OpRequest.
Greg Farnum [Mon, 30 Jan 2012 22:50:28 +0000 (14:50 -0800)]
osd: "mark" OpRequests as they move through the system.
Right now these are just informational flags which can be read out. Later
they might extend to timing information, separate lists for more precise
control over latency warnings, etc.
Sage Weil [Tue, 31 Jan 2012 21:00:45 +0000 (13:00 -0800)]
qa: test_backfill.sh: take osd.0 down
Mark this down to
1- trigger the WaitActingChange vs osd down race, and
2- help trigger a divergnet log when osd.2 is blackholed+restarted during
backfill. e.g.,
Sage Weil [Tue, 31 Jan 2012 17:53:32 +0000 (09:53 -0800)]
osd: restart peering if requesting acting osd goes down
If we request an acting set, we need to restart peering if one of the
requested nodes goes down. This prevents a deadlock where we get stuck
in WaitActingChange because we have [a,b], want [a,b,c], but c is down and
our up and acting don't actually change.
Sage Weil [Tue, 31 Jan 2012 15:25:04 +0000 (07:25 -0800)]
osd: fix divergent backfill targets
During peering, a previous backfill target may have a slightly newer
last_update than the other options, but it will not be chosen because it
is incomplete. That caused a failed assert during activate() (#1983).
To fix, we remove the bad assert, and then fix merge_log() so that the
replica/backfill target will trim its divergent entries when it gets the
activation MLogRec. We also fix the handling of MInfoRec, as that can
trigger the same analogous condition.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Tue, 31 Jan 2012 01:39:23 +0000 (17:39 -0800)]
filestore: implement filestore_blackhole hook
If true, we'll drop any new transactions on the floor. Useful for
triggering failure conditions (e.g., prior to killing ceph-osd itself, to
ensure some operations don't reach the local disk).
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Mon, 30 Jan 2012 22:27:24 +0000 (14:27 -0800)]
osd: disable clone overlap for push/pull
There is a bug in the push/pull code. Disable the recovery smarts by
default until we fix #2002.
There is currently a race (in the callers) where:
- an adjacent clone is missing
- we (calculate some clone overlap? and) start pulling
- we get adjacent clone
- we get push, calc a different overlap, and then get confused.
Sage Weil [Mon, 30 Jan 2012 04:54:18 +0000 (20:54 -0800)]
qa: test/rados-api/list fix warning
warning: test/rados-api/list.cc:43:156: converting ‘false’ to pointer type for argument 1 of ‘char testing::internal::IsNullLiteralHelper(testing::internal::Secret*)’ [-Wconversion-null]
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Mon, 30 Jan 2012 04:36:46 +0000 (20:36 -0800)]
test_ipaddr: reverse ASSERT_EQ order
Make these warnings go away:
warning: test/test_ipaddr.cc:217:156: converting ‘false’ to pointer type for argument 1 of ‘char testing::internal::IsNullLiteralHelper(testing::internal::Secret*)’ [-Wconversion-null]
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Sun, 29 Jan 2012 17:26:28 +0000 (09:26 -0800)]
mon: trim old auth states
These aren't exposed outside the monitor, so we really only keep them
around to assist in mon recovery. Give ourselves a healthy margin over
the max join drift for that.
Fixes: #2000 Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Thu, 26 Jan 2012 00:37:34 +0000 (16:37 -0800)]
filejournal: assume gibberish flags imply none
Old journals didn't properly initialize the flags (oops). Assume that
any bits besides the first 2 imply no flags.
Make note that this hack needs to be removed after some time has passed,
but well before these new flags are used. Or, such use should be
accompanied by a full header format rev and incompatibility.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Thu, 26 Jan 2012 00:36:17 +0000 (16:36 -0800)]
filejournal: include crc in entry header/footer
Use the unused flags field for this. Previously it was always 0, so this
lets us skip old entries on old journals and only worry about missing one
out of 2^32 corruptions. New journals get a flag that strictly enforces
the crc check.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Fri, 27 Jan 2012 21:21:39 +0000 (13:21 -0800)]
mon: mark pgs stale in pg_map if primary osd is down
This alerts the administrator when all OSDs for a PG have failed and the
monitor doesn't receive any further updates. Otherwise we may continue
to think a pg is active+clean when it is in fact offline.
Fixes: #1993 Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Fri, 27 Jan 2012 18:41:50 +0000 (10:41 -0800)]
filestore: dump offending transaction on any error
Clean this code up to explicitly whitelist what is ok so that the flow is
less annoying to follow/maintain, and so that we dump the transaction
contents on whitelisted errors.
Signed-off-by: Sage Weil <sage@newdream.net> Reviewed-by: Josh Durgin <josh.durgin@dreamhost.com>
Sage Weil [Fri, 27 Jan 2012 18:39:49 +0000 (10:39 -0800)]
objecter: fix bounds checking on op reply demuxing
We can't assume that the size of out_ops (from the reply) matches the
op->out_* vectors from our request state. In particular, the out_ops might
be shorter than what we sent the OSD if the OSD was sloppy. Check them.
We can assume that op->ops and op->out_* all match; assert as much in
op_submit().
Fixes: #1986 Signed-off-by: Sage Weil <sage.weil@dreamhost.com> Reviewed-by: Greg Farnum <gregory.farnum@dreamhost.com>
Sage Weil [Wed, 25 Jan 2012 20:38:06 +0000 (12:38 -0800)]
osd: remove num_kb from object_stat_sum_t stats
This is redundant--we can just use num_bytes. If we're worried about the
per-object overhead or rounding, we can factor in some overhead based on
num_objects.
And, the kb accounting has a bug (#1988).
Avoid changing the encoding at all for now. Next time the encoding changes
we'll drop the old field.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Wed, 25 Jan 2012 06:03:51 +0000 (22:03 -0800)]
osd: track obc for clone from log replay
We need to keep an in-memory obc to track the state of the in-flight io
to disk. This is analogous to when an object is pushed + written, and we
can share the same completion function.
Alexandre Oliva [Tue, 17 Jan 2012 19:22:17 +0000 (17:22 -0200)]
package *.py* files
Some post-install rpmbuild defaults byte-compile all packaged python
files, so don't bother removing the .pyc files, and package .py* to
get both .pyo and .pyc. It wastes a tiny little bit of space, but it
makes the spec file portable across a wider range of rpm and python
configurations.
Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicam.br> Signed-off-by: Sage Weil <sage@newdream.net>
Josh Durgin [Wed, 25 Jan 2012 00:52:27 +0000 (16:52 -0800)]
librbd: don't infinite loop when header is too large
Since snapshots are currently stored at the end of the header, having
many snapshots made the header larger than the read size, resulting in
an infinite loop when the offset was not changed.
Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com> Reviewed-by: Samuel Just <samuel.just@dreamhost.com>
Sage Weil [Mon, 23 Jan 2012 18:21:04 +0000 (10:21 -0800)]
osd: ignore MInfoRec, MNotifyRec in WaitActingChange
We should ignore logs, infos, and notifies while we are waiting for the
map to change. Peering has reached a dead-end (we need acting to change)
and we will redo our work when that happens. That includes the replicas
resending notifies.
Fixes: #1958 Signed-off-by: Sage Weil <sage.weil@dreamhost.com> Reviewed-by: Samuel Just <samuel.just@dreamhost.com>
Sage Weil [Thu, 19 Jan 2012 02:01:09 +0000 (18:01 -0800)]
osd: do not clobber log on backfill progress update
This is unnecessary and counterproductive, since the log is used to detect
dup ops. It's an artifact of an earlier backfill iteration that didn't
preserve the log on the backfill target.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Yehuda Sadeh [Fri, 20 Jan 2012 20:54:14 +0000 (12:54 -0800)]
rgw: read_user_buckets() fix redone
The problem with the original fix is that it wasn't atomic. Going back
to the original inefficient (though atomic) method. We should limit
the number of buckets per user anyway, and shouldn't get into a point
where this code is actually execised.
Neil Horman [Wed, 18 Jan 2012 17:00:14 +0000 (12:00 -0500)]
Convert mount.ceph to use KEY_SPEC_PROCESS_KEYRING
having mount.ceph use KEY_SPEC_USER_KEYRING to pass keys to the kernel has
several disadvantages:
1) It leaves the key setting in the uid_keyring, which is reachable from the
session keyring via a link (see keyctl list <root session keyring ref>). This
means its accessible to other processes in the same session that don't need
access to it, even after the kernel is done with it.
2) The user keyring has some very counter-intuitive semantics as far as keyring
permissions goes. The user keyring is access via a link from the session
keyring, which a process may not have permission to access in some situations.
For instance if mount.ceph is executed via su without having started a new
session, mount.ceph will not have access to the uid keyring unless the calling
proces (in this case su) has granted access permission. The result is a -EPERM
error when executing mount.ceph to a cephx enabled server. If the same command
is attempted in a new root session (e.g. su - or su -l), the mount command will
work fine
Switching the mount.ceph command to use the KEY_SPEC_PROCESS_KEYRING solves both
of these problems. By using this keyring, accessibility is guaranteed because
its added and accessed in the same process context both in user space and the
kernel, assuring aceesability, despite the session specifics. It also ensures
that the key will get cleaned up after the mount.ceph process exits
automatically, since there is no longer a need for it (the kernel clones the key
during the mount process and releases it on unmount).
I've tested this here on my local ceph cluster, and it works properly under both
su and su -l .
Signed-off-by: Neil Horman <nhorman@tuxdriver.com> CC: Josh Durgin <josh.durgin@dreamhost.com>