Sandon Van Ness [Fri, 8 Feb 2013 00:34:14 +0000 (16:34 -0800)]
Merge to include --machine-type and changes to --summary
Added the ability to support multiple types of machines with
--machine-type added to teuthology-lock when used with --lock-many
or --machine-type with teuthology --lock (automated tests). It
defaults to 'plana' and the 'vps' type is currently unused but
should be in the future.
Also updated teutholoy-lock --summary to be machine type aware
and sort things in a nice output.
Signed-off-by: Sandon Van Ness <sandon@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Sandon Van Ness [Tue, 5 Feb 2013 20:53:08 +0000 (12:53 -0800)]
Added support for multiple types of machines.
Added the ability to support multiple types of machines with
--machine-type added to teuthology-lock when used with --lock-many
or --machine-type with teuthology --lock (automated tests). It
defaults to 'plana' and the 'vps' type is currently unused but
should be in the future.
Signed-off-by: Sandon Van Ness <sandon@van-ness.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Josh Durgin [Wed, 6 Feb 2013 07:31:37 +0000 (23:31 -0800)]
nuke: don't try unmount if we're rebooting everything anyway
This can cause issues when unmount hangs. Our automatic runs reboot
everything unconditionally, so this caused a bunch of unecessary hangs
when an fs was accidentally rendered un-unmountable.
Sam Lang [Tue, 5 Feb 2013 22:20:52 +0000 (16:20 -0600)]
misc: Close connections on reboot
When nodes are rebooted, the connections remain open
even after calling reconnect and setting up new ssh
sessions to the rebooted nodes. This causes ECONNRESET
errors to show up in the teuthology output.
Close the existing connections before trying to reconnect.
Sam Lang [Tue, 5 Feb 2013 16:38:48 +0000 (10:38 -0600)]
task/ceph_manager: Fix NoneType config issue
kill_mon is getting a config set to None, which blows
up now due to the check for powercycle. Initialize
the config to an empty dict if we don't get anything
on init. This is the error showing up in teuthology:
2013-02-04T15:04:16.595 ERROR:teuthology.run_tasks:Manager failed: <contextlib.GeneratorContextManager object at 0x1fcafd0>
Traceback (most recent call last):
File "/var/lib/teuthworker/teuthology-master/teuthology/run_tasks.py", line 45, in run_tasks
suppress = manager.__exit__(*exc_info)
File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
self.gen.next()
File "/var/lib/teuthworker/teuthology-master/teuthology/task/mon_thrash.py", line 142, in task
thrash_proc.do_join()
File "/var/lib/teuthworker/teuthology-master/teuthology/task/mon_thrash.py", line 69, in do_join
self.thread.get()
File "/var/lib/teuthworker/teuthology-master/virtualenv/local/lib/python2.7/site-packages/gevent/greenlet.py", line 308, in get
raise self._exception
AttributeError: 'NoneType' object has no attribute 'get'
Sage Weil [Sun, 3 Feb 2013 05:01:08 +0000 (21:01 -0800)]
ceph_manager: use get() for self.config powercycle checks
I think this is what is going on...
Traceback (most recent call last):
File "/var/lib/teuthworker/teuthology-master/teuthology/contextutil.py", line 27, in nested
yield vars
File "/var/lib/teuthworker/teuthology-master/teuthology/task/ceph.py", line 1158, in task
yield
File "/var/lib/teuthworker/teuthology-master/teuthology/run_tasks.py", line 25, in run_tasks
manager = _run_one_task(taskname, ctx=ctx, config=config)
File "/var/lib/teuthworker/teuthology-master/teuthology/run_tasks.py", line 14, in _run_one_task
return fn(**kwargs)
File "/var/lib/teuthworker/teuthology-master/teuthology/task/dump_stuck.py", line 93, in task
manager.kill_osd(id_)
File "/var/lib/teuthworker/teuthology-master/teuthology/task/ceph_manager.py", line 665, in kill_osd
if 'powercycle' in self.config and self.config['powercycle']:
TypeError: argument of type 'NoneType' is not iterable
Sam Lang [Fri, 1 Feb 2013 17:45:04 +0000 (11:45 -0600)]
nuke: Fix cleanup of test dir
Nuke used to remove /tmp/cephtest, now it tries to
remove the test dir, which it may not have the name
for. Instead of removing the test dir, we just
remove the base directory for all test directories,
which may or may not be /tmp/cephtest.
Sam Lang [Fri, 1 Feb 2013 13:46:04 +0000 (07:46 -0600)]
run.py: Fix argument parsing for --name
With the addition of the --name argument to the
teuthology program (run.py), jobs were failing
because --name was being treated as a non-arg
option, even though the name was being supplied
by the workers. Fix that and give it a metavar.
Sam Lang [Wed, 23 Jan 2013 02:27:41 +0000 (20:27 -0600)]
Assign devices to osds using the device wwn
Linux doesn't guarantee device names (/dev/sdb, etc.)
are always mapped to the same disk. Instead of assigning
nominal devices to osds, we map devices by their wwn
(/dev/disk/by-id/wwn-*) to an osd (both data and journal).
Signed-off-by: Sam Lang <sam.lang@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Sam Lang [Wed, 23 Jan 2013 20:37:39 +0000 (14:37 -0600)]
Replace /tmp/cephtest/ with configurable path
Teuthology uses /tmp/cephtest/ as the scratch test directory for
a run. This patch replaces /tmp/cephtest/ everywhere with a
per-run directory: {basedir}/{rundir} where {basedir} is a directory
configured in .teuthology.yaml (/tmp/cephtest if not specified),
and {rundir} is the name of the run, as given in --name. If no name
is specified, {user}-{timestamp} is used.
To get the old behavior (/tmp/cephtest), set test_path: /tmp/cephtest
in .teuthology.yaml.
This change was modivated by #3782, which requires a test dir that
survives across reboots, but also resolves #3767.
Signed-off-by: Sam Lang <sam.lang@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Sam Lang [Thu, 31 Jan 2013 13:56:56 +0000 (07:56 -0600)]
Scripts to use pyflakes to check python syntax.
pyflakes runs a basic syntax checker against python code.
The added check-syntax.sh script and Makefile run pyflakes
on the python code within the teuthology directory reporting
any syntax errors that are found.
Signed-off-by: Sam Lang <sam.lang@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
task: mon_clock_skew_check: use absolute value when comparing mon_skew
The monitors may report either positive or negative clock skews, and by
not using an absolute value we were constantly ignoring reported negative
clock skews.
Signed-off-by: Joao Eduardo Luis <jecluis@gmail.com>
task: mon_clock_skew_check: mark as ran once if an expected skew was found
... even if we didn't get a clean/finished result from the monitors
This ought to significantly cut the waiting time if something else (or
someone else) is leaving the leader hanging thus unable to finish a given
timecheck round.
Signed-off-by: Joao Eduardo Luis <jecluis@gmail.com>
task: mon_clock_skew_check: increase timeout and kick it off only on stop
We were kicking-off the timeout as soon as we started; it's better however
to kick if off only when we are told to stop (as long as 'at-least-once'
is true).
Signed-off-by: Joao Eduardo Luis <jecluis@gmail.com>
Sandon Van Ness [Wed, 23 Jan 2013 22:08:53 +0000 (14:08 -0800)]
Use ceph git repo instead of github.
This code change is so that instead of pulling the tarball of github
which can be unreliable at times it instead uses the ceph repo mirror
and serves as the same function. Now it is using git archive and no
longer uses wget. Because of this less tar-fu is needed to extract
the necessary files as it can be done directly through git archive.
Signed-off-by: Sandon Van Ness <sandon@inktank.com> Reviewed-by: Sam Lang <sam.lang@inktank.com>
Joe Buck [Mon, 14 Jan 2013 20:09:56 +0000 (12:09 -0800)]
test: create /tmp/cephtest/mnt.{id}
The workunit task assumes that a mount exists
at /tmp/cephtest/mnt.{id}
This patch creates the path if it doesn't
exist, enabling workunits to run in the absense
of kclient or ceph-fuse tasks.
Signed-off-by: Joe Buck <jbbuck@gmail.com> Reviewed-by: Sam Lang <sam.lang@inktank.com>
task: mon_clock_skew_check: add option to run at least one timecheck
at-least-once Runs at least once, even if we are told to stop.
(default: True)
at-least-once-timeout If we were told to stop but we are attempting to
run at least once, timeout after this many
seconds. (default: 300)
Fixes: #3854 Signed-off-by: Joao Eduardo Luis <jecluis@gmail.com>
Alex Elder [Fri, 18 Jan 2013 18:47:34 +0000 (12:47 -0600)]
rbd.py: update scratch and test image sizes
Test 167 was failing due to running out of space on the scratch
file system. The test reserves 21MB in a file, and repeats 50
times. It required just over 1GB, so I bumped the default size
for the testing device to 1200 MB. I increased the test device
size as well.
This resolves http://tracker.newdream.net/issues/3864.
Loic Dachary [Wed, 16 Jan 2013 10:48:05 +0000 (11:48 +0100)]
When running teuthology with targets provisionned on OpenStack and kvm, the disks will show under /dev/vda, /dev/vdb etc. Add them to the list of devices to inspect and use for tests.
Josh Durgin [Tue, 15 Jan 2013 01:55:43 +0000 (17:55 -0800)]
Add cram task
This runs cram tests, which are an easy way to test output
stays consistent. We already use cram for basic cli tests with no cluster,
and now we can use it for whole system tests too.
Sage Weil [Mon, 14 Jan 2013 06:52:00 +0000 (22:52 -0800)]
ceph.conf: separate replicas across osds
ceph.git master now separates across crush hosts without this setting.
For teuthology clusters, we don't want that (unless the tests specifies
otherwise).
Sam Lang [Thu, 27 Dec 2012 23:33:07 +0000 (17:33 -0600)]
task/pexec: Add barrier capability
This patch adds the ability to barrier between
parallel exec tasks so that all tasks will perform
the following step (after the barrier) at the same
time.
Sam Lang [Fri, 14 Dec 2012 17:30:15 +0000 (07:30 -1000)]
task/pexec: More fixes for all case, exec on hosts
We don't want to do an exec per role, but per-host. We
were already doing an exec per host, but the names were confusing.
This fixes the names up and removes the role parameters.
task: mon_clock_skew_check.py: Check for clock skews on the monitors
Will run for as long as teuthology runs. By default, fails if any clock
skews higher than 0.05 seconds are detected, but will only fail when the
teuthology run finishes and after reporting a list of all the detected
skews.
Accepted options:
interval amount of seconds to wait in-between checks. (default: 30.0)
max-skew maximum skew, in seconds, that is considered tolerable
before issuing a warning. (default: 0.05)
expect-skew 'true' or 'false', to indicate whether to expect a skew
during the run or not. If 'true', the test will fail if no
skew is found, and succeed if a skew is indeed found; if
'false', it's the other way around. (default: false)
never-fail Don't fail the run if a skew is detected and we weren't
expecting it, or if no skew is detected and we were
expecting it. (default: False)
Signed-off-by: Joao Eduardo Luis <jecluis@gmail.com>
Joe Buck [Thu, 6 Dec 2012 22:19:55 +0000 (14:19 -0800)]
Adding a Hadoop task.
This task configures and starts a Hadoop cluster.
It does not run any jobs, that must be done after
this task runs.
Can run on either Ceph or HDFS.
Joe Buck [Thu, 6 Dec 2012 22:18:41 +0000 (14:18 -0800)]
New ssh task that adds keys for node -> node ssh.
This generates a new keypair, pushes it to all nodes
in the context and adds all hosts to all other hosts
.ssh/authorized_keys file.
Cleans up all keys and authorized_keys entries
afterwards.
Signed-off-by: Joe Buck <jbbuck@gmail.com> Reviewed-by: Sam Lang <sam.lang@inktank.com>