]>
git.apps.os.sepia.ceph.com Git - ceph.git/log
Sage Weil [Wed, 14 Mar 2012 20:20:54 +0000 (13:20 -0700)]
run valgrind with cwd set to /tmp/cephtest/archive/coredump
This lets us capture the vgcore.* files, which always go to valgrind's
cwd.
Fixes: #1953
Josh Durgin [Fri, 16 Mar 2012 18:40:17 +0000 (11:40 -0700)]
suite: log results and coverage generation
Need to figure out where and when results emails are failing.
Josh Durgin [Thu, 15 Mar 2012 23:21:33 +0000 (16:21 -0700)]
results: make sure email is sent before anything else fails
Mark Nelson [Wed, 14 Mar 2012 20:32:23 +0000 (15:32 -0500)]
Merge branch 'master' of github.com:ceph/teuthology
Sage Weil [Tue, 13 Mar 2012 17:09:18 +0000 (10:09 -0700)]
gitbuilder: put flavor last
in case we refine the field later
Sage Weil [Tue, 13 Mar 2012 17:02:26 +0000 (10:02 -0700)]
Pull from new gitbuilder.ceph.com locations.
Simplifies the flavor stuff into a tuple of
<package,type,flavor,dist,arch>
where package is ceph, kenrel, etc.
type is tarball, deb
flavor is basic, gcov, notcmalloc
arch is x86_64, i686 (uname -m)
dist is oneiric, etc. (lsb_release -s -c)
Mark Nelson [Mon, 12 Mar 2012 20:13:36 +0000 (15:13 -0500)]
Made the example better with multiple roles.
Mark Nelson [Mon, 12 Mar 2012 19:33:10 +0000 (14:33 -0500)]
Added some example yaml files and an example parallel execution task.
Sage Weil [Sun, 11 Mar 2012 03:15:21 +0000 (19:15 -0800)]
autotest: pull from github.com/ceph/autotest
Sage Weil [Sat, 10 Mar 2012 23:34:19 +0000 (15:34 -0800)]
workunit: include python2.7 path too
Samuel Just [Fri, 17 Feb 2012 00:10:45 +0000 (16:10 -0800)]
rados.py: include setattr and rmattr
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Mark Nelson [Wed, 7 Mar 2012 16:34:55 +0000 (08:34 -0800)]
lock: Improved logging when there aren't enough nodes available to lock-many.
Mark Nelson [Wed, 7 Mar 2012 17:02:39 +0000 (09:02 -0800)]
lock: Added a --locked flag to teuthology-lock.
Can be used to restrict searches based on lock status, e.g.
'teuthology-lock --list -a --locked false --status up' shows available nodes.
Sage Weil [Tue, 6 Mar 2012 17:34:38 +0000 (09:34 -0800)]
nuke: unmount osd data directories
This helps us avoid reboot to clean up osd data directories that are left
mounted.
Josh Durgin [Mon, 5 Mar 2012 18:28:35 +0000 (10:28 -0800)]
Use non-zero exit status if any tests failed
Fixes: #1989
Sage Weil [Fri, 2 Mar 2012 18:55:19 +0000 (10:55 -0800)]
github.com/NewDreamNetwork -> github.com/ceph
Josh Durgin [Wed, 29 Feb 2012 23:47:17 +0000 (15:47 -0800)]
dump_stuck: note required ceph configuration
Josh Durgin [Tue, 28 Feb 2012 21:55:46 +0000 (13:55 -0800)]
dump_stuck: verify that 'ceph health' mentions the right number of inactive/unclean/stale pgs
Sage Weil [Tue, 28 Feb 2012 17:50:29 +0000 (09:50 -0800)]
peer: ignore +scrubbing portion of pg state
It can cause the mon state and osd states to not match.
Sage Weil [Sun, 26 Feb 2012 05:05:00 +0000 (21:05 -0800)]
peer: wait for peering to complete, or block
We need to wait for peering to either complete, or block because it is
waiting for another PG. _Then_ look at all the PG states and compare the
mon values with what we get from qeurying the OSDs directly.
Josh Durgin [Fri, 24 Feb 2012 23:01:34 +0000 (15:01 -0800)]
peer: remove unused variable
Josh Durgin [Fri, 24 Feb 2012 22:55:49 +0000 (14:55 -0800)]
misc: always return a usable result from get_valgrind_args
Josh Durgin [Fri, 24 Feb 2012 22:55:23 +0000 (14:55 -0800)]
rgw: simplify valgrind args
Sage Weil [Fri, 24 Feb 2012 23:05:17 +0000 (15:05 -0800)]
add peer task
Force a pg to get stuck in 'down' state, verify we can query the peering
state, then start the OSD so it can recover.
Sage Weil [Fri, 24 Feb 2012 19:11:59 +0000 (11:11 -0800)]
lost_unfound: list missing/unfound for each pg and verify the unfound counts
This also tests the pg list_missing functionality.
Sage Weil [Fri, 24 Feb 2012 17:22:03 +0000 (09:22 -0800)]
ceph_manager: list_pg_missing
List missing objects for the given pgid.
Josh Durgin [Fri, 24 Feb 2012 20:04:58 +0000 (12:04 -0800)]
Whitespace and unnecessary formatting fixes
Josh Durgin [Fri, 24 Feb 2012 19:21:04 +0000 (11:21 -0800)]
ceph, ceph-fuse: simplify valgrind argument additions
Sage Weil [Wed, 22 Feb 2012 17:18:17 +0000 (09:18 -0800)]
refactor all valgrind users to use a get_valgrind_args() helper
This avoids much annoying, duplicated code.
Sage Weil [Wed, 22 Feb 2012 01:06:50 +0000 (17:06 -0800)]
ceph: always create valgrind logs dir
Other tasks use it too. It's more annoying to conditionally create it.
Sage Weil [Wed, 22 Feb 2012 00:10:37 +0000 (16:10 -0800)]
ceph: always try to process valgrind logs
Check for errors in valgrind logs even if there is no valgrind option
the ceph task config stanza. Other tasks can run via valgrind (ceph-fuse,
rgw). If the logs aren't there, this is harmless.
Sage Weil [Wed, 22 Feb 2012 00:08:21 +0000 (16:08 -0800)]
rgw: add valgrind support
tasks:
- ceph:
- rgw:
client.a:
valgrind: [--tool=memcheck]
Sage Weil [Tue, 21 Feb 2012 23:47:32 +0000 (15:47 -0800)]
rgw: accept dict
e.g.,
tasks:
...
- rgw:
client.0:
client.1:
Sage Weil [Fri, 24 Feb 2012 04:07:24 +0000 (20:07 -0800)]
lost_unfound: new mark_unfound_lost syntax
Josh Durgin [Fri, 24 Feb 2012 01:07:26 +0000 (17:07 -0800)]
dump_stuck: flush stats before waiting for recovery/clean
Josh Durgin [Tue, 21 Feb 2012 21:11:05 +0000 (13:11 -0800)]
Add a task for testing stuck pg visibility.
Josh Durgin [Tue, 21 Feb 2012 23:01:45 +0000 (15:01 -0800)]
Move duration calculation to an internal task
This excludes all generic start up costs, like waiting for locks,
rebooting into a new kernel, etc.
Josh Durgin [Tue, 21 Feb 2012 22:54:33 +0000 (14:54 -0800)]
Add necessary imports for s3 tasks, and keep them alphabetical.
Yehuda Sadeh [Tue, 21 Feb 2012 20:23:38 +0000 (12:23 -0800)]
s3roundtrip, s3readwrite: access key uses url safe chars
Signed-off-by: Yehuda Sadeh <yehuda.sadeh@dreamhost.com>
Yehuda Sadeh [Tue, 21 Feb 2012 20:12:03 +0000 (12:12 -0800)]
rgw: access key uses url safe chars
Signed-off-by: Yehuda Sadeh <yehuda.sadeh@dreamhost.com>
Sage Weil [Mon, 20 Feb 2012 23:17:52 +0000 (15:17 -0800)]
ceph: valgrind trumps coverage when picking a flavor
valgrind will crash if we don't use notcmalloc; coverage will silently
fail to collect coverage info.
Sage Weil [Mon, 20 Feb 2012 22:54:10 +0000 (14:54 -0800)]
ceph.conf: no lockdep by default
Sage Weil [Mon, 20 Feb 2012 21:38:06 +0000 (13:38 -0800)]
suite.results: include test duration in output
Sage Weil [Mon, 20 Feb 2012 15:12:53 +0000 (07:12 -0800)]
cfuse -> ceph-fuse
Sage Weil [Mon, 20 Feb 2012 15:04:45 +0000 (07:04 -0800)]
ceph: allow valgrind per-type (not just per-name)
Sage Weil [Mon, 20 Feb 2012 03:40:45 +0000 (19:40 -0800)]
lost_unfound: mark osds in when we revive them
so that we test what we meant to. It also lets us actually go clean at the
very end.
Sage Weil [Sat, 18 Feb 2012 22:44:53 +0000 (14:44 -0800)]
ceph_manager: ignore stale states when counting
also remove assumptions about ordering of states
Sage Weil [Sat, 18 Feb 2012 05:53:25 +0000 (21:53 -0800)]
wait_till_clean -> wait_for_clean and wait_for_recovery
Clean now also means the correct number of replicas, whereas recovered
means we have done all the work we can do given the replicas/osds we have.
For example, degraded and clean are now mutually exclusive.
Also move away from 'till'.
Sage Weil [Tue, 14 Feb 2012 23:24:11 +0000 (15:24 -0800)]
backfill: wait for clean before writing+blackholing
If we have straggler pgs and blackhole osd.1, we can deadlock because we
need info from that osd to repeer and continue. Make sure we're clean, and
then start the write + blackhole + kill test.
Sage Weil [Tue, 14 Feb 2012 23:23:19 +0000 (15:23 -0800)]
nuke: nuke testrados too
Slightly fewer nuke -r's
Sage Weil [Sun, 12 Feb 2012 22:36:11 +0000 (14:36 -0800)]
ceph_manager: mark in a bit more often than out
Otherwise we can get into cases where many/most nodes are out, and things
don't work as well. e.g., crush may start to fail.
Sage Weil [Sat, 11 Feb 2012 22:24:39 +0000 (14:24 -0800)]
ceph: use any fs, not just btrfs, on scratch devices
The
btrfs: true
syntax is replaced with
fs: btrfs
or ext4, xfs.
Sage Weil [Sat, 11 Feb 2012 22:20:41 +0000 (14:20 -0800)]
nuke: nuke testrados and rados processes, too
So that -r is needed slightly less often.
Sage Weil [Sat, 11 Feb 2012 22:20:18 +0000 (14:20 -0800)]
misc: make get_scratch_devices look for (almost) any disk that's not mounted
Sage Weil [Sat, 11 Feb 2012 22:19:49 +0000 (14:19 -0800)]
hammer.sh: assume path is set
Josh Durgin [Thu, 2 Feb 2012 17:29:03 +0000 (09:29 -0800)]
ceph: always add logger for daemons
The extra log function added redundant info and didn't allow different
levels.
Josh Durgin [Thu, 2 Feb 2012 17:27:11 +0000 (09:27 -0800)]
ceph: rename type parameter to type_
type is a built-in and shouldn't be aliased.
Josh Durgin [Thu, 2 Feb 2012 17:27:04 +0000 (09:27 -0800)]
ceph: use the correct comparison operator
is compares identity (i.e. address in cpython), not value.
Josh Durgin [Thu, 2 Feb 2012 17:26:45 +0000 (09:26 -0800)]
ceph: sync before unmounting btrfs devices
There may still be writes in flight, since the osds may not have
shutdown cleanly. This should prevent EBUSY when unmounting.
Fixes: #1997
Josh Durgin [Thu, 2 Feb 2012 17:26:25 +0000 (09:26 -0800)]
ceph: delay raising exceptions until all daemons are stopped
If a daemon crashes, the exception is raised when we stop it. This
caused some daemons to continue running during cleanup, since the rest
of the daemons of the same type would not be shut down. Also log each
daemon that crashed, for easier debugging.
Fixes: #1744
Sage Weil [Wed, 1 Feb 2012 00:25:53 +0000 (16:25 -0800)]
add backfill task
This does a basic test of backfill functionality, including a divergent
log on a backfill target (#1983).
Sage Weil [Wed, 1 Feb 2012 00:13:59 +0000 (16:13 -0800)]
ceph_manager: add manager.blackhole_kill_osd()
This will suspend disk writes for a couple seconds and then kill the
daemon. It helps us similute a hardware failure.
Tommi Virtanen [Tue, 31 Jan 2012 16:05:36 +0000 (08:05 -0800)]
Allow user to disable lock checking.
The new plana hardware isn't in the old sepia lock database,
and the machine pools are risky to merge as nothing in the
software guarantees allocation from just one pool. This allows
us to hand-allocate machines temporarily.
Tommi Virtanen [Tue, 31 Jan 2012 15:59:26 +0000 (07:59 -0800)]
Allow user to provide flavor to use.
With this, you can use Ubuntu 11.10 machines with teuthology by saying::
tasks:
- ceph:
flavor: oneiric
...
Josh Durgin [Fri, 27 Jan 2012 19:26:42 +0000 (11:26 -0800)]
Add admin socket task.
This simply gets the output of an admin socket command, makes sure
it's json, and runs a user-provided test script on it.
Samuel Just [Tue, 24 Jan 2012 19:28:38 +0000 (11:28 -0800)]
CephManager: base timeout on time since last change in active+clean
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Josh Durgin [Tue, 17 Jan 2012 23:35:19 +0000 (15:35 -0800)]
kernel: ignore connection problems while waiting for reboot
Sage Weil [Tue, 17 Jan 2012 17:24:54 +0000 (09:24 -0800)]
thrashosds: maxdead default to 0
This avoids any possibility of blocking peering.
Sage Weil [Tue, 17 Jan 2012 00:53:55 +0000 (16:53 -0800)]
task/rados: use new usage for radosmodel tool
Sage Weil [Mon, 16 Jan 2012 22:43:56 +0000 (14:43 -0800)]
thrashosds: fix action selection
I'm not sure what the old code was trying to do, but I'm pretty sure it
wasn't doing it correctly.. a .1 chance_down was killing an OSD for me
virtually every time.
Sage Weil [Mon, 16 Jan 2012 22:40:34 +0000 (14:40 -0800)]
thrashosds: make actions less nonsensical
Make marking OSD up/down and in/out totally orthogonal.
Signed-off-by: Sage Weil <sage@newdream.net>
Sage Weil [Mon, 16 Jan 2012 21:18:49 +0000 (13:18 -0800)]
ls: include duration, less noise
Sage Weil [Mon, 16 Jan 2012 21:18:31 +0000 (13:18 -0800)]
hammer.sh: new -nuke syntax
Sage Weil [Mon, 16 Jan 2012 20:39:20 +0000 (12:39 -0800)]
include run duration in summary.yaml
Sage Weil [Mon, 16 Jan 2012 18:47:44 +0000 (10:47 -0800)]
ls: fix extraneous newline
Sage Weil [Mon, 16 Jan 2012 17:55:47 +0000 (09:55 -0800)]
ceph: ignore all leaks
unless/until we figure out where the DefinitelyLost records are coming
from.. at first glance they look bogus.
Sage Weil [Tue, 20 Dec 2011 22:10:22 +0000 (14:10 -0800)]
ceph: take single arg or list for valgrind args
Sage Weil [Mon, 19 Dec 2011 22:12:39 +0000 (14:12 -0800)]
combined mon, osd, mds starter functions
Sage Weil [Fri, 23 Sep 2011 16:40:52 +0000 (09:40 -0700)]
rbd: default to all:
Sage Weil [Mon, 16 Jan 2012 06:48:33 +0000 (22:48 -0800)]
use local mirrors for (most) github urls
A cronjob on ceph.newdream.net updates these every 15 minutes. Sigh.
Sage Weil [Sat, 14 Jan 2012 06:08:33 +0000 (22:08 -0800)]
teuthology-ls: show pid, last line of output for running jobs
Sage Weil [Sat, 14 Jan 2012 05:56:37 +0000 (21:56 -0800)]
show host -> roles mapping on startup
Less guessing when manually inspecting an in-progress or hung run.
Sage Weil [Thu, 12 Jan 2012 23:08:11 +0000 (15:08 -0800)]
lost_unfound: make test work with backfill
If we backfill, we fail to peer instead of having every object show up as
'unfound'. Avoid that by preventing log trimming, so that we always do
log recovery for this test.
Tommi Virtanen [Fri, 13 Jan 2012 19:26:36 +0000 (11:26 -0800)]
Use yaml.safe_dump so unicode doesn't mess up the yaml files.
In general, yaml.dump is comparable to pickle, and my personal
coding standard says *never* use it. yaml.safe_dump is much nicer.
yaml.dump should have been named yaml.unsafe_dump, yaml.safe_dump
should have been named yaml.dump :(
Josh Durgin [Thu, 12 Jan 2012 22:48:36 +0000 (14:48 -0800)]
nuke: take config files from -t argument
teuthology-lock and teuthology-updatekeys both use -t for this already
Josh Durgin [Thu, 12 Jan 2012 20:57:22 +0000 (12:57 -0800)]
kernel: loop reconnecting in case we race with shutdown
Previously, if we reconnected before shutdown completed we asserted
that the kernel did not boot into the new version, when we just needed
to wait for the machine to reboot.
Sage Weil [Wed, 11 Jan 2012 14:59:41 +0000 (06:59 -0800)]
thrasher: don't mark down osds out; tell monitor same
Stopping ceph-osd doesn't make it out (immediately). Prevent monitor
from doing this after a delay too so we can keep our notion of what is
up/down/in/out accurate.
Sage Weil [Wed, 11 Jan 2012 00:21:00 +0000 (16:21 -0800)]
lost_unfound: typo
Sage Weil [Wed, 11 Jan 2012 00:20:50 +0000 (16:20 -0800)]
thrasher: adjust min_dead default
Make this 1, not 2. That's a bit more friendly. It doesn't strictly
matter, tho, since we revive osds before waiting for clean.
Sage Weil [Tue, 10 Jan 2012 21:57:55 +0000 (13:57 -0800)]
thrasher: add max_dead
Add max_dead, and revive osds prior to waiting for clean. Otherwise we
can leave too many OSDs down and the cluster will never go clean.
Sage Weil [Sun, 8 Jan 2012 23:14:18 +0000 (15:14 -0800)]
verify all osds start before checking health
Just checking health isn't good enough, since it races with OSD startup:
we can have a healthy cluster with 0 (or something else < total) OSDs.
Josh Durgin [Wed, 11 Jan 2012 00:04:09 +0000 (16:04 -0800)]
ceph: let the user running ceph-osd remove subvolumes
This will prevent EPERM when using the SNAP_DESTROY ioctl,
so the filestore will use btrfs snaps.
Josh Durgin [Tue, 10 Jan 2012 23:24:44 +0000 (15:24 -0800)]
syslog: ignore lockdep non-static key warning
It looks like this warning was made default in linux 3.2.
This will keep happening until #1922 is done.
Sage Weil [Sun, 8 Jan 2012 22:39:30 +0000 (14:39 -0800)]
run: put pid in archive dir
This will make it easy for teuthology-ls to show you the running process's
pid (if it's still running). Or for other utiltizes to kill + clean up
a hung teuthology run.
Sage Weil [Sat, 7 Jan 2012 01:21:38 +0000 (17:21 -0800)]
ceph_manager: a booting osd is no longer automatically marked in
as of ceph.git commit
96b7b0d83e5fe70a4efb4e284e18b4b40840bfec
Sage Weil [Fri, 6 Jan 2012 23:12:15 +0000 (15:12 -0800)]
mon_recovery: need n/2 + 1 monitors for quorum
Sage Weil [Fri, 6 Jan 2012 21:36:54 +0000 (13:36 -0800)]
ceph: don't skip monitor ports
We can use the same port multiple times if they are on a different hosts.
Josh Durgin [Fri, 6 Jan 2012 01:27:28 +0000 (17:27 -0800)]
suite: make email-on-success the default behavior
This way you can tell when a run is complete, instead of wondering if
it's stuck in the queue.
Josh Durgin [Tue, 3 Jan 2012 22:07:45 +0000 (14:07 -0800)]
rados: fix example config
Josh Durgin [Tue, 3 Jan 2012 20:25:14 +0000 (12:25 -0800)]
nuke-on-error: only unlock if this run locked the machines