I'm seeing failed tasks (and nuke) leak machines. It looks like we are
getting an exception on the '... reboot -f -n' command when we should be
ignoring it and waiting for the machine to restart.
For example:
http://qa-proxy.ceph.com/teuthology/sage-2013-12-08_19:25:06-rados:thrash-wip-tier-foo-basic-plana/136321/teuthology.log
Signed-off-by: Sage Weil <sage@inktank.com>
nodes = {}
for remote in remotes:
log.info('rebooting %s', remote.name)
- proc = remote.run( # note use of -n to force a no-sync reboot
- args=[
- 'sync',
- run.Raw('&'),
- 'sleep', '5',
- run.Raw(';'),
- 'sudo', 'reboot', '-f', '-n'
- ],
- wait=False
- )
+ try:
+ proc = remote.run( # note use of -n to force a no-sync reboot
+ args=[
+ 'sync',
+ run.Raw('&'),
+ 'sleep', '5',
+ run.Raw(';'),
+ 'sudo', 'reboot', '-f', '-n'
+ ],
+ wait=False
+ )
+ except Exception:
+ log.exception('ignoring exception during reboot command')
nodes[remote] = proc
# we just ignore these procs because reboot -f doesn't actually
# send anything back to the ssh client!