]> git.apps.os.sepia.ceph.com Git - teuthology.git/log
teuthology.git
13 years agoAdd admin socket task.
Josh Durgin [Fri, 27 Jan 2012 19:26:42 +0000 (11:26 -0800)]
Add admin socket task.

This simply gets the output of an admin socket command, makes sure
it's json, and runs a user-provided test script on it.

13 years agoCephManager: base timeout on time since last change in active+clean
Samuel Just [Tue, 24 Jan 2012 19:28:38 +0000 (11:28 -0800)]
CephManager: base timeout on time since last change in active+clean

Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
13 years agokernel: ignore connection problems while waiting for reboot
Josh Durgin [Tue, 17 Jan 2012 23:35:19 +0000 (15:35 -0800)]
kernel: ignore connection problems while waiting for reboot

13 years agothrashosds: maxdead default to 0
Sage Weil [Tue, 17 Jan 2012 17:24:54 +0000 (09:24 -0800)]
thrashosds: maxdead default to 0

This avoids any possibility of blocking peering.

13 years agotask/rados: use new usage for radosmodel tool
Sage Weil [Tue, 17 Jan 2012 00:53:55 +0000 (16:53 -0800)]
task/rados: use new usage for radosmodel tool

13 years agothrashosds: fix action selection
Sage Weil [Mon, 16 Jan 2012 22:43:56 +0000 (14:43 -0800)]
thrashosds: fix action selection

I'm not sure what the old code was trying to do, but I'm pretty sure it
wasn't doing it correctly.. a .1 chance_down was killing an OSD for me
virtually every time.

13 years agothrashosds: make actions less nonsensical
Sage Weil [Mon, 16 Jan 2012 22:40:34 +0000 (14:40 -0800)]
thrashosds: make actions less nonsensical

Make marking OSD up/down and in/out totally orthogonal.

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agols: include duration, less noise
Sage Weil [Mon, 16 Jan 2012 21:18:49 +0000 (13:18 -0800)]
ls: include duration, less noise

13 years agohammer.sh: new -nuke syntax
Sage Weil [Mon, 16 Jan 2012 21:18:31 +0000 (13:18 -0800)]
hammer.sh: new -nuke syntax

13 years agoinclude run duration in summary.yaml
Sage Weil [Mon, 16 Jan 2012 20:39:20 +0000 (12:39 -0800)]
include run duration in summary.yaml

13 years agols: fix extraneous newline
Sage Weil [Mon, 16 Jan 2012 18:47:44 +0000 (10:47 -0800)]
ls: fix extraneous newline

13 years agoceph: ignore all leaks
Sage Weil [Mon, 16 Jan 2012 17:55:47 +0000 (09:55 -0800)]
ceph: ignore all leaks

unless/until we figure out where the DefinitelyLost records are coming
from.. at first glance they look bogus.

13 years agoceph: take single arg or list for valgrind args
Sage Weil [Tue, 20 Dec 2011 22:10:22 +0000 (14:10 -0800)]
ceph: take single arg or list for valgrind args

13 years agocombined mon, osd, mds starter functions
Sage Weil [Mon, 19 Dec 2011 22:12:39 +0000 (14:12 -0800)]
combined mon, osd, mds starter functions

13 years agorbd: default to all:
Sage Weil [Fri, 23 Sep 2011 16:40:52 +0000 (09:40 -0700)]
rbd: default to all:

13 years agouse local mirrors for (most) github urls
Sage Weil [Mon, 16 Jan 2012 06:48:33 +0000 (22:48 -0800)]
use local mirrors for (most) github urls

A cronjob on ceph.newdream.net updates these every 15 minutes.  Sigh.

13 years agoteuthology-ls: show pid, last line of output for running jobs
Sage Weil [Sat, 14 Jan 2012 06:08:33 +0000 (22:08 -0800)]
teuthology-ls: show pid, last line of output for running jobs

13 years agoshow host -> roles mapping on startup
Sage Weil [Sat, 14 Jan 2012 05:56:37 +0000 (21:56 -0800)]
show host -> roles mapping on startup

Less guessing when manually inspecting an in-progress or hung run.

13 years agolost_unfound: make test work with backfill
Sage Weil [Thu, 12 Jan 2012 23:08:11 +0000 (15:08 -0800)]
lost_unfound: make test work with backfill

If we backfill, we fail to peer instead of having every object show up as
'unfound'.  Avoid that by preventing log trimming, so that we always do
log recovery for this test.

13 years agoUse yaml.safe_dump so unicode doesn't mess up the yaml files.
Tommi Virtanen [Fri, 13 Jan 2012 19:26:36 +0000 (11:26 -0800)]
Use yaml.safe_dump so unicode doesn't mess up the yaml files.

In general, yaml.dump is comparable to pickle, and my personal
coding standard says *never* use it. yaml.safe_dump is much nicer.
yaml.dump should have been named yaml.unsafe_dump, yaml.safe_dump
should have been named yaml.dump :(

13 years agonuke: take config files from -t argument
Josh Durgin [Thu, 12 Jan 2012 22:48:36 +0000 (14:48 -0800)]
nuke: take config files from -t argument

teuthology-lock and teuthology-updatekeys both use -t for this already

13 years agokernel: loop reconnecting in case we race with shutdown
Josh Durgin [Thu, 12 Jan 2012 20:57:22 +0000 (12:57 -0800)]
kernel: loop reconnecting in case we race with shutdown

Previously, if we reconnected before shutdown completed we asserted
that the kernel did not boot into the new version, when we just needed
to wait for the machine to reboot.

13 years agothrasher: don't mark down osds out; tell monitor same
Sage Weil [Wed, 11 Jan 2012 14:59:41 +0000 (06:59 -0800)]
thrasher: don't mark down osds out; tell monitor same

Stopping ceph-osd doesn't make it out (immediately).  Prevent monitor
from doing this after a delay too so we can keep our notion of what is
up/down/in/out accurate.

13 years agolost_unfound: typo
Sage Weil [Wed, 11 Jan 2012 00:21:00 +0000 (16:21 -0800)]
lost_unfound: typo

13 years agothrasher: adjust min_dead default
Sage Weil [Wed, 11 Jan 2012 00:20:50 +0000 (16:20 -0800)]
thrasher: adjust min_dead default

Make this 1, not 2.  That's a bit more friendly.  It doesn't strictly
matter, tho, since we revive osds before waiting for clean.

13 years agothrasher: add max_dead
Sage Weil [Tue, 10 Jan 2012 21:57:55 +0000 (13:57 -0800)]
thrasher: add max_dead

Add max_dead, and revive osds prior to waiting for clean.  Otherwise we
can leave too many OSDs down and the cluster will never go clean.

13 years agoverify all osds start before checking health
Sage Weil [Sun, 8 Jan 2012 23:14:18 +0000 (15:14 -0800)]
verify all osds start before checking health

Just checking health isn't good enough, since it races with OSD startup:
we can have a healthy cluster with 0 (or something else < total) OSDs.

13 years agoceph: let the user running ceph-osd remove subvolumes
Josh Durgin [Wed, 11 Jan 2012 00:04:09 +0000 (16:04 -0800)]
ceph: let the user running ceph-osd remove subvolumes

This will prevent EPERM when using the SNAP_DESTROY ioctl,
so the filestore will use btrfs snaps.

13 years agosyslog: ignore lockdep non-static key warning
Josh Durgin [Tue, 10 Jan 2012 23:24:44 +0000 (15:24 -0800)]
syslog: ignore lockdep non-static key warning

It looks like this warning was made default in linux 3.2.
This will keep happening until #1922 is done.

13 years agorun: put pid in archive dir
Sage Weil [Sun, 8 Jan 2012 22:39:30 +0000 (14:39 -0800)]
run: put pid in archive dir

This will make it easy for teuthology-ls to show you the running process's
pid (if it's still running).  Or for other utiltizes to kill + clean up
a hung teuthology run.

13 years agoceph_manager: a booting osd is no longer automatically marked in
Sage Weil [Sat, 7 Jan 2012 01:21:38 +0000 (17:21 -0800)]
ceph_manager: a booting osd is no longer automatically marked in

as of ceph.git commit 96b7b0d83e5fe70a4efb4e284e18b4b40840bfec

13 years agomon_recovery: need n/2 + 1 monitors for quorum
Sage Weil [Fri, 6 Jan 2012 23:12:15 +0000 (15:12 -0800)]
mon_recovery: need n/2 + 1 monitors for quorum

13 years agoceph: don't skip monitor ports
Sage Weil [Fri, 6 Jan 2012 21:36:54 +0000 (13:36 -0800)]
ceph: don't skip monitor ports

We can use the same port multiple times if they are on a different hosts.

13 years agosuite: make email-on-success the default behavior
Josh Durgin [Fri, 6 Jan 2012 01:27:28 +0000 (17:27 -0800)]
suite: make email-on-success the default behavior

This way you can tell when a run is complete, instead of wondering if
it's stuck in the queue.

13 years agorados: fix example config
Josh Durgin [Tue, 3 Jan 2012 22:07:45 +0000 (14:07 -0800)]
rados: fix example config

13 years agonuke-on-error: only unlock if this run locked the machines
Josh Durgin [Tue, 3 Jan 2012 20:25:14 +0000 (12:25 -0800)]
nuke-on-error: only unlock if this run locked the machines

13 years agoRemove unused mon.0 variables.
Josh Durgin [Tue, 3 Jan 2012 20:05:17 +0000 (12:05 -0800)]
Remove unused mon.0 variables.

13 years agorados: use testrados instead of testsnaps and testreadwrite
Josh Durgin [Sat, 31 Dec 2011 03:27:27 +0000 (19:27 -0800)]
rados: use testrados instead of testsnaps and testreadwrite

13 years agorados: remove unused variable
Josh Durgin [Fri, 30 Dec 2011 21:53:45 +0000 (13:53 -0800)]
rados: remove unused variable

13 years agorados: clean up argument construction
Josh Durgin [Fri, 30 Dec 2011 21:48:58 +0000 (13:48 -0800)]
rados: clean up argument construction

Only the client id varies, so it can be done outside the loop. Also
handle coredumps and coverage, and use LD_LIBRARY_PATH instead of
LD_PRELOAD.

13 years agorados: fix references to testrados
Josh Durgin [Fri, 30 Dec 2011 20:54:55 +0000 (12:54 -0800)]
rados: fix references to testrados

13 years agorados: fix documentation format
Josh Durgin [Fri, 30 Dec 2011 20:50:59 +0000 (12:50 -0800)]
rados: fix documentation format

13 years agomisc: simplify reconnect logic
Josh Durgin [Fri, 30 Dec 2011 20:23:28 +0000 (12:23 -0800)]
misc: simplify reconnect logic

Ignore all errors until the timeout expires so we don't have to worry
about whitelisting them.

13 years agoteuthology rgw-admin: annotated test cases for inventory
Mark Kampe [Thu, 29 Dec 2011 21:09:08 +0000 (13:09 -0800)]
teuthology rgw-admin: annotated test cases for inventory
   this is not a nose suite, so I simply added test case
   descriptions in csv format, and put a file to extract
   them at the top of the file.
Signed-off-by: Mark Kampe <mark.kampe@dreamhost.com>
13 years agosyslog checking: forgot a pipe
Josh Durgin [Sat, 17 Dec 2011 02:09:09 +0000 (18:09 -0800)]
syslog checking: forgot a pipe

13 years agorountrip: add task
Yehuda Sadeh [Thu, 15 Dec 2011 21:24:53 +0000 (13:24 -0800)]
rountrip: add task

13 years agoreadwrite: fix task with default conf
Yehuda Sadeh [Thu, 15 Dec 2011 20:39:39 +0000 (12:39 -0800)]
readwrite: fix task with default conf

13 years agoreadwrite: fix conf, task runs
Yehuda Sadeh [Thu, 15 Dec 2011 01:14:30 +0000 (17:14 -0800)]
readwrite: fix conf, task runs

13 years agoreadwrite: add readwrite task
Yehuda Sadeh [Thu, 15 Dec 2011 00:12:01 +0000 (16:12 -0800)]
readwrite: add readwrite task

still not really running, but at least getting configured

13 years agocoverage: use locally stored build instead of downloading from a gitbuilder
Josh Durgin [Wed, 14 Dec 2011 00:16:09 +0000 (16:16 -0800)]
coverage: use locally stored build instead of downloading from a gitbuilder

13 years agoIgnore lockdep being turned off for now.
Josh Durgin [Tue, 13 Dec 2011 00:29:37 +0000 (16:29 -0800)]
Ignore lockdep being turned off for now.

Some machines are hitting this udev issue:
http://marc.info/?l=linux-kernel&m=132033587908426&w=2 and lockdep is
turned off after the first warning.

13 years agocoverage: don't generate html reports for each test
Josh Durgin [Fri, 9 Dec 2011 01:47:14 +0000 (17:47 -0800)]
coverage: don't generate html reports for each test

These can always be generated from the lcov files later, right now they just waste space.

13 years agosyslog: ignore 'task blocked' warnings
Josh Durgin [Fri, 9 Dec 2011 01:17:47 +0000 (17:17 -0800)]
syslog: ignore 'task blocked' warnings

These will happen under heavy load (usually on the osd).

13 years agointernal: check syslog for errors
Josh Durgin [Wed, 7 Dec 2011 23:20:33 +0000 (15:20 -0800)]
internal: check syslog for errors

This should catch lockdep warnings and mark tests with them as failed.

13 years agoworkunit: set client id and secretfile env vars
Josh Durgin [Wed, 7 Dec 2011 00:16:38 +0000 (16:16 -0800)]
workunit: set client id and secretfile env vars

These are used by the kernel rbd workunit to know how to map images.

Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
13 years agoRename "testrados" and "testswift" tasks to not begin with "test".
Tommi Virtanen [Mon, 5 Dec 2011 18:07:25 +0000 (10:07 -0800)]
Rename "testrados" and "testswift" tasks to not begin with "test".

Anything "test*" looks like a unit test, and shouldn't be used for
actual code.

13 years agoFix unit tests for SSH keep-alive setting.
Tommi Virtanen [Mon, 5 Dec 2011 17:55:02 +0000 (09:55 -0800)]
Fix unit tests for SSH keep-alive setting.

Commit 6e3e0d7cdcb5ba70f938f0850a8828aca2753ab5 failed to pass
unit tests.

13 years agoHandle interactive-on-error also when error is from contextmanager exit.
Tommi Virtanen [Thu, 1 Dec 2011 01:07:26 +0000 (17:07 -0800)]
Handle interactive-on-error also when error is from contextmanager exit.

Closes: http://tracker.newdream.net/issues/1745
13 years agoProperly handle case where first error is inside a context manager __exit__.
Tommi Virtanen [Tue, 22 Nov 2011 00:00:19 +0000 (16:00 -0800)]
Properly handle case where first error is inside a context manager __exit__.

Closes: http://tracker.newdream.net/issues/1743
13 years agonuke: don't specify full path
Sage Weil [Sun, 20 Nov 2011 04:56:26 +0000 (20:56 -0800)]
nuke: don't specify full path

/tmp/cephtest/binary may have been removed; kill stray daemons by name
only.  we really don't care about false positives here!

13 years agoceph_manager: %
Sage Weil [Thu, 17 Nov 2011 21:52:17 +0000 (13:52 -0800)]
ceph_manager: %

13 years agoSave summary after nuking machines.
Josh Durgin [Fri, 18 Nov 2011 21:53:51 +0000 (13:53 -0800)]
Save summary after nuking machines.

This way you can tell when tests are entirely finished running.

13 years agoAdd an example overrides file for running regression tests.
Josh Durgin [Fri, 18 Nov 2011 20:22:18 +0000 (12:22 -0800)]
Add an example overrides file for running regression tests.

13 years agosuite: put common config before facets
Josh Durgin [Fri, 18 Nov 2011 01:26:21 +0000 (17:26 -0800)]
suite: put common config before facets

This lets you add tasks to the beginning of a run, like the chef task.

13 years agosuite: schedule a list of collections for running instead of a single suite directory
Josh Durgin [Fri, 18 Nov 2011 01:14:05 +0000 (17:14 -0800)]
suite: schedule a list of collections for running instead of a single suite directory

13 years agotestswift: fix config
Yehuda Sadeh [Fri, 18 Nov 2011 00:53:21 +0000 (16:53 -0800)]
testswift: fix config

13 years agoClean up C++isms.
Tommi Virtanen [Fri, 18 Nov 2011 01:00:44 +0000 (17:00 -0800)]
Clean up C++isms.

13 years agoAdd a task for easily running chef-solo on all the nodes.
Tommi Virtanen [Fri, 18 Nov 2011 00:49:47 +0000 (16:49 -0800)]
Add a task for easily running chef-solo on all the nodes.

13 years agoceph_manager: fix logging
Sage Weil [Thu, 17 Nov 2011 21:46:02 +0000 (13:46 -0800)]
ceph_manager: fix logging

13 years agoceph: deep merge overrides, so e.g. log whitelists can be overridden
Josh Durgin [Thu, 17 Nov 2011 21:07:03 +0000 (13:07 -0800)]
ceph: deep merge overrides, so e.g. log whitelists can be overridden

13 years agomisc: move deep_merge out of the MergeConfig class - it's generic
Josh Durgin [Thu, 17 Nov 2011 21:06:36 +0000 (13:06 -0800)]
misc: move deep_merge out of the MergeConfig class - it's generic

13 years agoSave config after locking nodes, so targets are included.
Josh Durgin [Thu, 17 Nov 2011 19:57:07 +0000 (11:57 -0800)]
Save config after locking nodes, so targets are included.

13 years agofilestore_idempotent: remove unused import
Josh Durgin [Thu, 17 Nov 2011 19:18:24 +0000 (11:18 -0800)]
filestore_idempotent: remove unused import

13 years agomon_recovery: remove unused code and import
Josh Durgin [Thu, 17 Nov 2011 19:15:47 +0000 (11:15 -0800)]
mon_recovery: remove unused code and import

13 years agothrashosds: timeout for every clean check, not just the last one
Josh Durgin [Thu, 17 Nov 2011 19:11:33 +0000 (11:11 -0800)]
thrashosds: timeout for every clean check, not just the last one

13 years agoceph_manager: add a default timeout of 5 minutes for mon quorum
Josh Durgin [Thu, 17 Nov 2011 19:05:12 +0000 (11:05 -0800)]
ceph_manager: add a default timeout of 5 minutes for mon quorum

13 years agoceph_manager: log mon quorum status so the logs show progress (or lack thereof)
Josh Durgin [Thu, 17 Nov 2011 18:45:19 +0000 (10:45 -0800)]
ceph_manager: log mon quorum status so the logs show progress (or lack thereof)

13 years agorgw: add swift task
Yehuda Sadeh [Thu, 17 Nov 2011 00:00:01 +0000 (16:00 -0800)]
rgw: add swift task

still not completely working (for some reason it skips all the tests)

13 years agofilestore_idempotent.py: simple task to test non-idempotent osd ops
Sage Weil [Fri, 11 Nov 2011 05:35:11 +0000 (21:35 -0800)]
filestore_idempotent.py: simple task to test non-idempotent osd ops

Write some non-idempotent events to the osd.  Simulate a failure.  Verify
the result is correct on replay.

This must be preceeded by the ceph task just so that we get the binaries
installed.  Should clean this up later if/when the installation gets
factored out of ceph.py.

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agomisc: allow >1 monitor per role in get_mon_names()
Sage Weil [Thu, 10 Nov 2011 22:13:24 +0000 (14:13 -0800)]
misc: allow >1 monitor per role in get_mon_names()

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agoadd hammer.sh
Sage Weil [Wed, 9 Nov 2011 21:37:02 +0000 (13:37 -0800)]
add hammer.sh

simple script to repeat a test until it fails.  can probably do something much more sophisticated
here, but this works.

13 years agonuke: increase reboot timeout
Josh Durgin [Wed, 9 Nov 2011 18:39:56 +0000 (10:39 -0800)]
nuke: increase reboot timeout

Some sepia nodes are very slow to reboot.

13 years agomon_recovery: add task to test monitor cluster failure recovery
Sage Weil [Wed, 9 Nov 2011 06:06:43 +0000 (22:06 -0800)]
mon_recovery: add task to test monitor cluster failure recovery

Some simple tests to start with.  We still need some sort of mon cluster
thrashing.

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agoceph_manager: manipulate monitors
Sage Weil [Wed, 9 Nov 2011 06:02:58 +0000 (22:02 -0800)]
ceph_manager: manipulate monitors

13 years agoceph: keep ceph.conf at ctx.ceph.conf
Sage Weil [Wed, 9 Nov 2011 06:00:32 +0000 (22:00 -0800)]
ceph: keep ceph.conf at ctx.ceph.conf

Signed-off-by: Sage Weil <sage@newdream.net>
13 years agoRemove unused imports and variable.
Josh Durgin [Wed, 9 Nov 2011 00:06:33 +0000 (16:06 -0800)]
Remove unused imports and variable.

13 years agoAdd nuke-on-error option.
Josh Durgin [Wed, 9 Nov 2011 00:01:39 +0000 (16:01 -0800)]
Add nuke-on-error option.

This lets automated jobs nuke and unlock machines after failed
tests. Each machine is nuke individually, so one down machine won't
keep others from being nuked and unlocked.

13 years agoFix leftover orchestra import clause.
Tommi Virtanen [Mon, 7 Nov 2011 21:05:14 +0000 (13:05 -0800)]
Fix leftover orchestra import clause.

This seems to be a leftover from
a2372fce12b6bd1818e155d1d8ed5134dbd8fd4a,
no idea how it stayed hidden this long.

13 years agoceph_manager: log ceph -s output so progress is visible in the logs
Josh Durgin [Thu, 3 Nov 2011 20:27:44 +0000 (13:27 -0700)]
ceph_manager: log ceph -s output so progress is visible in the logs

13 years agoKeep each ssh connection alive.
Josh Durgin [Thu, 3 Nov 2011 20:08:39 +0000 (13:08 -0700)]
Keep each ssh connection alive.

With long-running jobs like thrashing, ssh connections were timing
out.

13 years agoconnection: allow the caller to specify whether keep-alive should be used
Josh Durgin [Thu, 3 Nov 2011 20:07:21 +0000 (13:07 -0700)]
connection: allow the caller to specify whether keep-alive should be used

13 years agolocker: fix race in locking
Josh Durgin [Thu, 3 Nov 2011 18:26:45 +0000 (11:26 -0700)]
locker: fix race in locking

The isolation level is lower than I thought. This made it possible for
two clients to think they both locked the same machines, since the
update would still be modifying each row to change the locked_since
time.

13 years agotestrados: set CEPH_CLIENT_ID without a ;
Samuel Just [Wed, 2 Nov 2011 18:33:37 +0000 (11:33 -0700)]
testrados: set CEPH_CLIENT_ID without a ;

Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
13 years agotestrados: specify CEPH_CONF directly
Samuel Just [Mon, 31 Oct 2011 21:26:41 +0000 (14:26 -0700)]
testrados: specify CEPH_CONF directly

Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
14 years agorgw: add user suspend/enable test
Yehuda Sadeh [Thu, 27 Oct 2011 19:11:28 +0000 (12:11 -0700)]
rgw: add user suspend/enable test

14 years agorgw: log-to-stderr is now a binary flag
Yehuda Sadeh [Thu, 27 Oct 2011 18:32:12 +0000 (11:32 -0700)]
rgw: log-to-stderr is now a binary flag

14 years agotestrados: rename testsnaps to testrados and make snap testing optional
Samuel Just [Mon, 24 Oct 2011 21:23:48 +0000 (14:23 -0700)]
testrados: rename testsnaps to testrados and make snap testing optional

Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
14 years agoworkunit: set PYTHONPATH so we can test python bindings
Josh Durgin [Mon, 24 Oct 2011 20:52:29 +0000 (13:52 -0700)]
workunit: set PYTHONPATH so we can test python bindings

14 years agoceph.conf: python parser doens't like ; comments
Sage Weil [Sun, 23 Oct 2011 17:30:27 +0000 (10:30 -0700)]
ceph.conf: python parser doens't like ; comments

14 years agoceph.conf: more frequent osd scrubbing; remove old cruft
Sage Weil [Sun, 23 Oct 2011 05:16:39 +0000 (22:16 -0700)]
ceph.conf: more frequent osd scrubbing; remove old cruft