Zack Cerza [Fri, 17 Mar 2017 20:27:51 +0000 (14:27 -0600)]
add_remotes: Correctly map remotes to roles
We used to use the 'targets' object to make remotes to roles. This
worked fine before multi-OS locking, but broke down because of the
unordered nature of dicts.
Zack Cerza [Tue, 14 Mar 2017 19:13:13 +0000 (13:13 -0600)]
Allow locking nodes with mixed OSes
Instead of either not specifying an OS type/version and just getting
what's available - or requesting the same OS type/version for all nodes
in the job, allow requesting arbitrary mixes of OSes.
The path forward is going to be replacing things like:
roles:
- ['osd.0', 'mon.a']
- ['osd.1', 'osd.2]
- ['client.0']
Zack Cerza [Fri, 10 Mar 2017 16:41:46 +0000 (09:41 -0700)]
orchestra.run: Notice when short-lived procs exit
If a command exits immediately, there is a race between greenlet
completion (which flushes the ChannelFile buffers) and the call to
exit_status_ready(). Waiting for 0.1s on the greenlets removes the race
condition.
Dan Mick [Mon, 6 Mar 2017 23:41:36 +0000 (15:41 -0800)]
downburst: always log output and error, check returncode for failure
The more info the better; always log everything about the downburst
execution to the teuthology log. Check for command failure by
checking for returncode != 0 rather than "presence of stderr", since
logging always happens to stderr.
Dan Mick [Mon, 6 Mar 2017 23:40:05 +0000 (15:40 -0800)]
provision: invoke downburst with -v and --logfile
Verbose output isn't verbose enough to matter, and can be helpful
tracking down weirdness. Also, log to private file in case
downburst hangs mid-operation, to avoid having to do any
select() madness in teuthology.
Zack Cerza [Wed, 1 Mar 2017 23:20:35 +0000 (16:20 -0700)]
cloud.openstack: Also retry on BaseHTTPError
We attach volumes immediately after creating them; sometimes they are
still momentarily in the 'creating' state, causing the attach call to
throw a BaseHTTPError. When that happens, simply retry the request
instead of failing node creation, starting the entire cycle all over
again.
Zack Cerza [Fri, 10 Feb 2017 17:25:53 +0000 (10:25 -0700)]
cloud.openstack: Cache authentication tokens
Constantly causing Keystone to regenerate auth tokens was the cause of
our hitting rate limits during testing. This will let us reuse auth
tokens - including across processes - to avoid hitting those limits.
Zack Cerza [Mon, 6 Feb 2017 21:16:13 +0000 (14:16 -0700)]
cloud: Retry failed requests in libcloud
It's common to see "429 Rate limit exceeded", at least with OVH. When we
encounter the exception associated with that exception, backoff and
retry for an interval before eventually giving up.
Zack Cerza [Thu, 8 Dec 2016 03:30:57 +0000 (20:30 -0700)]
Add libcloud backend
Initially this supports OpenStack but will grow to support other methods
of cloud-like deployment. Some assuptions are made regarding supporting
infrastructure (FIXME document these)
Ilya Dryomov [Fri, 17 Feb 2017 11:56:20 +0000 (12:56 +0100)]
run: allow using alternate suite repo
Do the same thing we do for ceph repo to make ceph.git commit 1f82b9b9446d ("qa/tasks/workunit: use the suite repo for cloning
workunit") work for scheduled jobs.
Ilya Dryomov [Tue, 7 Feb 2017 09:55:45 +0000 (10:55 +0100)]
console: force existing connections into spy mode if !readonly
If someone watching the console didn't think of using "console -s", we
end up power cycling the node in an attempt to get the login prompt.
This is futile -- if the watcher is still there after the node comes
back up, our connection will get dropped to spy mode again.
Use -f to temporarily force existing connections into spy mode when we
attach to save a power cycle.
Ilya Dryomov [Fri, 3 Feb 2017 09:59:43 +0000 (10:59 +0100)]
nuke: improve stale_kernel_mount() check
Commit 7db9e8b76fd5 ("nuke: bring stale kernel client handling back")
resurrected the check that was removed in commit 1d47a121b385 ("Fix
nuke, redo some cleanup functions"). It isn't sufficient though -- for
example, if a workunit already issued a umount, /etc/mtab won't have
a '^/dev/rbd' entry.
debugfs is enabled and mounted on all distros we care about.
Ilya Dryomov [Wed, 1 Feb 2017 19:37:49 +0000 (20:37 +0100)]
nuke: drop remove_kernel_mounts()
Calling remove_kernel_mounts() after reboot() is pretty useless. Also,
as explained in the previous commit, there isn't much we can do in the
krbd case, so just drop it.
Ilya Dryomov [Wed, 1 Feb 2017 19:37:49 +0000 (20:37 +0100)]
nuke: bring stale kernel client handling back
Commit 1d47a121b385 ("Fix nuke, redo some cleanup functions") broke
stale kernel client map/mount handling by dropping reboot arguments.
While for kcephfs we can use 'umount -f' to avoid sync (it used to not
work, but is mostly fixed now, I believe), currently there is nothing
we can do for a local filesystem mounted on top of krbd.
Dan Mick [Fri, 13 Jan 2017 22:49:05 +0000 (14:49 -0800)]
Use clock by default (instead of clock.check)
We're seeing clocks desynchronized. My theory is that this might
be because ntp can take five minutes or more to actually sync the
clocks, and clock.check doesn't do any setting of the clocks, just
reporting. clock, OTOH, stops ntpd and does an ntpdate, and then
restarts ntpd, which should kickstart it with a much-closer-to-correct
time.
Zack Cerza [Mon, 16 Jan 2017 23:16:41 +0000 (16:16 -0700)]
worker: Create job archive directories
... not just run archive directories. This is to resolve a race
condition between the job creating its archive directory and the worker
symlinking its log into that directory.