Sam Lang [Thu, 27 Dec 2012 23:33:07 +0000 (17:33 -0600)]
task/pexec: Add barrier capability
This patch adds the ability to barrier between
parallel exec tasks so that all tasks will perform
the following step (after the barrier) at the same
time.
Sam Lang [Fri, 14 Dec 2012 17:30:15 +0000 (07:30 -1000)]
task/pexec: More fixes for all case, exec on hosts
We don't want to do an exec per role, but per-host. We
were already doing an exec per host, but the names were confusing.
This fixes the names up and removes the role parameters.
Joe Buck [Thu, 6 Dec 2012 22:19:55 +0000 (14:19 -0800)]
Adding a Hadoop task.
This task configures and starts a Hadoop cluster.
It does not run any jobs, that must be done after
this task runs.
Can run on either Ceph or HDFS.
Joe Buck [Thu, 6 Dec 2012 22:18:41 +0000 (14:18 -0800)]
New ssh task that adds keys for node -> node ssh.
This generates a new keypair, pushes it to all nodes
in the context and adds all hosts to all other hosts
.ssh/authorized_keys file.
Cleans up all keys and authorized_keys entries
afterwards.
Signed-off-by: Joe Buck <jbbuck@gmail.com> Reviewed-by: Sam Lang <sam.lang@inktank.com>
Josh Durgin [Tue, 20 Nov 2012 22:01:03 +0000 (14:01 -0800)]
xfstests: run in parallel on multiple machines
xfstests itself still seems to have some global dependencies that
make it hard to run more than one instance per node, so keep
the one client per node restriction.
Name the image after the client using it, and only run the
nested context managers once, so this task can work with
more than one client.
Samuel Just [Fri, 9 Nov 2012 00:22:40 +0000 (16:22 -0800)]
Add divergent_priors test
Tests scenario where merge_old_entry encounters a divergent
entry where the prior_version is prior to log_tail. This
is a problem since it will go into the missing set, but won't
be re-added to the missing set during read_log() if the node
restarts prior to recovering the object.
Sam Lang [Thu, 8 Nov 2012 14:55:36 +0000 (08:55 -0600)]
workunit: Move cleanup to separate run
Removing the scratchdir in the remote run command
at the end of the script invocation will do the remove
once the first script finishes. With possibly a shared
scratch dir across workunit clients, we want to wait to
remove the scratch dir once all the workunit scripts have
completed.
Samuel Just [Wed, 7 Nov 2012 20:36:37 +0000 (12:36 -0800)]
ceph_manager: add test_min_size action
Thrasher can now with configurable frequency test min_size by
taking down all but one osd, waiting, killing that osd and bringing
back the others, and verifying that the cluster goes clean.
Alex Elder [Thu, 1 Nov 2012 18:32:56 +0000 (13:32 -0500)]
rbd task: support xfstests repeat count
This adds the ability to use the new repeat count argument to the
run_xfstests.sh script. By default, the test suite will be run
once, but if a count is specified the script will execute the suite
that many times, but will only perform the setup (building the
tests, etc.) once.
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Dan Mick <dan.mick@inktank.com>
Allow scheduled jobs to use different teuthology branches
teuthology-[schedule|suite] get a parameter to specify the branch,
to put the job in a branch-specific queue. Workers running that
branch of teuthology can pull jobs from that queue.
Tommi Virtanen [Tue, 11 Sep 2012 18:11:39 +0000 (11:11 -0700)]
Don't lose tracebacks of exceptions raised in a greenlet.
Exception objects don't contain the traceback of where they were
raised from (to avoid cyclic data structures wrecking gc and causing
mem leaks), so the singular "raise obj" form creates a new traceback
from the current execution location, thus losing the original location
of the error.
Gevent explicitly wants to throw away the traceback, to release any
objects the greenlet may still be referring to, closing files,
releasing locks etc. In this case, we think it's safe, so stash the
exception info away in a holder object, and resurrect it on the other
side of the results queue.
Tommi Virtanen [Mon, 13 Aug 2012 23:10:05 +0000 (16:10 -0700)]
Disable asynchronous DNS lookups.
Especially on older hosts, we keep triggering errors::
ServerNotFoundError: Unable to find the server at
teuthology.front.sepia.ceph.com: [Errno 3] name does not exist
That comes from libevent's evdns via gevent.dns and httplib2. The rate
of these errors is low enough that they seem to be perhaps timeouts,
or more arbitrary. Busy looping on DNS resolution calls has never
triggered them, so far.
With ``monkey.patch_all(dns=False)``, the teuthology process will
block as a whole whenever doing DNS resolution. This will hopefully be
rare enough that it won't matter.
The only real "fix" seems to be upgrading libraries and hoping for the
best; this commit can be reverted after that is done.