Dan Mick [Mon, 6 Mar 2017 23:41:36 +0000 (15:41 -0800)]
downburst: always log output and error, check returncode for failure
The more info the better; always log everything about the downburst
execution to the teuthology log. Check for command failure by
checking for returncode != 0 rather than "presence of stderr", since
logging always happens to stderr.
Dan Mick [Mon, 6 Mar 2017 23:40:05 +0000 (15:40 -0800)]
provision: invoke downburst with -v and --logfile
Verbose output isn't verbose enough to matter, and can be helpful
tracking down weirdness. Also, log to private file in case
downburst hangs mid-operation, to avoid having to do any
select() madness in teuthology.
Zack Cerza [Wed, 1 Mar 2017 23:20:35 +0000 (16:20 -0700)]
cloud.openstack: Also retry on BaseHTTPError
We attach volumes immediately after creating them; sometimes they are
still momentarily in the 'creating' state, causing the attach call to
throw a BaseHTTPError. When that happens, simply retry the request
instead of failing node creation, starting the entire cycle all over
again.
Zack Cerza [Fri, 10 Feb 2017 17:25:53 +0000 (10:25 -0700)]
cloud.openstack: Cache authentication tokens
Constantly causing Keystone to regenerate auth tokens was the cause of
our hitting rate limits during testing. This will let us reuse auth
tokens - including across processes - to avoid hitting those limits.
Zack Cerza [Mon, 6 Feb 2017 21:16:13 +0000 (14:16 -0700)]
cloud: Retry failed requests in libcloud
It's common to see "429 Rate limit exceeded", at least with OVH. When we
encounter the exception associated with that exception, backoff and
retry for an interval before eventually giving up.
Zack Cerza [Thu, 8 Dec 2016 03:30:57 +0000 (20:30 -0700)]
Add libcloud backend
Initially this supports OpenStack but will grow to support other methods
of cloud-like deployment. Some assuptions are made regarding supporting
infrastructure (FIXME document these)
Ilya Dryomov [Fri, 17 Feb 2017 11:56:20 +0000 (12:56 +0100)]
run: allow using alternate suite repo
Do the same thing we do for ceph repo to make ceph.git commit 1f82b9b9446d ("qa/tasks/workunit: use the suite repo for cloning
workunit") work for scheduled jobs.
Ilya Dryomov [Tue, 7 Feb 2017 09:55:45 +0000 (10:55 +0100)]
console: force existing connections into spy mode if !readonly
If someone watching the console didn't think of using "console -s", we
end up power cycling the node in an attempt to get the login prompt.
This is futile -- if the watcher is still there after the node comes
back up, our connection will get dropped to spy mode again.
Use -f to temporarily force existing connections into spy mode when we
attach to save a power cycle.
Ilya Dryomov [Fri, 3 Feb 2017 09:59:43 +0000 (10:59 +0100)]
nuke: improve stale_kernel_mount() check
Commit 7db9e8b76fd5 ("nuke: bring stale kernel client handling back")
resurrected the check that was removed in commit 1d47a121b385 ("Fix
nuke, redo some cleanup functions"). It isn't sufficient though -- for
example, if a workunit already issued a umount, /etc/mtab won't have
a '^/dev/rbd' entry.
debugfs is enabled and mounted on all distros we care about.
Ilya Dryomov [Wed, 1 Feb 2017 19:37:49 +0000 (20:37 +0100)]
nuke: drop remove_kernel_mounts()
Calling remove_kernel_mounts() after reboot() is pretty useless. Also,
as explained in the previous commit, there isn't much we can do in the
krbd case, so just drop it.
Ilya Dryomov [Wed, 1 Feb 2017 19:37:49 +0000 (20:37 +0100)]
nuke: bring stale kernel client handling back
Commit 1d47a121b385 ("Fix nuke, redo some cleanup functions") broke
stale kernel client map/mount handling by dropping reboot arguments.
While for kcephfs we can use 'umount -f' to avoid sync (it used to not
work, but is mostly fixed now, I believe), currently there is nothing
we can do for a local filesystem mounted on top of krbd.
Dan Mick [Fri, 13 Jan 2017 22:49:05 +0000 (14:49 -0800)]
Use clock by default (instead of clock.check)
We're seeing clocks desynchronized. My theory is that this might
be because ntp can take five minutes or more to actually sync the
clocks, and clock.check doesn't do any setting of the clocks, just
reporting. clock, OTOH, stops ntpd and does an ntpdate, and then
restarts ntpd, which should kickstart it with a much-closer-to-correct
time.
Zack Cerza [Mon, 16 Jan 2017 23:16:41 +0000 (16:16 -0700)]
worker: Create job archive directories
... not just run archive directories. This is to resolve a race
condition between the job creating its archive directory and the worker
symlinking its log into that directory.
Nathan Cutler [Tue, 27 Dec 2016 10:43:15 +0000 (11:43 +0100)]
nuke: Use pkill -KILL to unconditionally wipe out hadoop processes
In CentOS 7, the command "ps -ef | grep 'java.*hadoop' | grep -v grep | awk
'{print $2}' | xargs kill -9" produces undesirable output when no matching
processes exist.
Note: the "-f" option to pkill mimics the semantics of "ps -ef". For example,
"ps -ef | grep 'java.*hadoop'" will match a process called "sh java343hadoop",
while pkill will match that process only with the -f option:
Dan Mick [Tue, 20 Dec 2016 03:44:52 +0000 (19:44 -0800)]
spawn_sol_log: use sys.executable to find bin/python
An activated virtualenv has PATH set to find Python with /usr/bin/env.
A binary run from virtualenv/bin doesn't set PATH. Use sys.executable
to handle both invocation methods.
Fixes: http://tracker.ceph.com/issues/17986 Signed-off-by: Dan Mick <dan.mick@redhat.com>