David Galloway [Fri, 9 Mar 2018 21:51:25 +0000 (16:51 -0500)]
cobbler: Have rc.local output go to console
Usually if something goes wrong during the rc.local run, the machine
won't be reachable to debug over the network. Additionally, since we
reimage every machine before each job now, it's impossible to debug why
rc.local failed given a particular job. This outputs rc.local to the
tty specified in kernel_options so we can see the output in `$hostname_reimage` run logs.
Signed-off-by: David Galloway <dgallowa@redhat.com>
David Galloway [Fri, 9 Mar 2018 19:58:08 +0000 (14:58 -0500)]
cobbler: Write exact /etc/default/grub
This fixes console output during Xenial and later. Prior to this, the
Plymouth boot screen would get loaded and "[37mUbuntu 16.04[-1;-1f[33m.
[37m. [37m. [37m." would get repeated to the console until the login
prompt shows up.
Writing our own file instead of finding and replacing variables makes
sure the settings are exactly what we want.
This snippet is only used on Debian-based distros. The default Cobbler
snippet is used on RPM-based distros.
Signed-off-by: David Galloway <dgallowa@redhat.com>
David Galloway [Tue, 20 Mar 2018 15:22:53 +0000 (11:22 -0400)]
cobbler: Change method used to ping Cobbler host in rc.local
I've observed a *very* occasional race condition where dhclient
completes but the host can't ping Cobbler. Instead of timing out
waiting for one ping packet to return, we'll try pinging X number of
times (based on $attempts number) and then give up.
I'll paste an example of the race condition observed in the PR notes.
Signed-off-by: David Galloway <dgallowa@redhat.com>
David Galloway [Mon, 26 Feb 2018 18:56:58 +0000 (13:56 -0500)]
pcp: Disable role for now
With the addition of RHEL to Sepia, teuthology will be running
cephlab.yml on unregistered RHEL testnodes. Since the PCP playbook gets run
before the testnodes playbook, RHEL systems in Sepia won't be registered
to our Satellite yet and PCP installation fails.
We're not currently using PCP so we can disable the role and save some
time and headache.
Signed-off-by: David Galloway <dgallowa@redhat.com>
David Galloway [Thu, 1 Feb 2018 20:42:36 +0000 (15:42 -0500)]
nameserver: Let records tasks coexist with DDNS
It takes about 3 minutes for ansible to compile all the zone files.
That was causing nsupdate/DDNS to overwrite any new records we wanted to
add or change before named could be reloaded.
This PR:
- Writes zone files to a temporary location
- Dumps pending DDNS changes into zone files
- Freezes DDNS zone files from updates
- Moves temporary zone files into place all at once
- Unfreezes DDNS zone files
This results in about a 3 second window where DDNS updates will be
refused which isn't great but we can at least update records while OVH
jobs are running now.
Signed-off-by: David Galloway <dgallowa@redhat.com>
David Galloway [Fri, 19 Jan 2018 20:31:03 +0000 (15:31 -0500)]
cobbler: Use MAC address specified in ansible inventory instead of eth0
I concede. Name it whatever you want, RHEL.
This will allow the OS to use the "predictable naming" during anaconda
and after firstboot preventing NIC names from switching like we're
seeing in http://tracker.ceph.com/issues/22732 and http://tracker.ceph.com/issues/22643
Signed-off-by: David Galloway <dgallowa@redhat.com>
David Galloway [Wed, 10 Jan 2018 21:17:55 +0000 (16:17 -0500)]
cobbler: Remove DHCP config for NICs if ifup fails in rc.local
An issue was discovered where rc.local bails if a testnode has multiple
NICs cabled but each NIC doesn't have a DHCP reservation. For example,
some of the magnas have a second NIC cabled but are cabled to a tagged
port on the switch so they can pass traffic via multiple VLANs.
Fixes: http://tracker.ceph.com/issues/22651 Signed-off-by: David Galloway <dgallowa@redhat.com>
David Galloway [Wed, 15 Nov 2017 15:53:04 +0000 (10:53 -0500)]
vmhost: Allow task to fail but ignore errors
With a recent update to ansible, the changed task would never return a
'failed' result with `failed_when` set. `ignore_errors` is what we want
so the task fails but the playbook proceeds.
Signed-off-by: David Galloway <dgallowa@redhat.com>
David Galloway [Fri, 29 Sep 2017 22:50:20 +0000 (18:50 -0400)]
testnode: Avoid job failure due to fs loop
Example:
failed: [smithi205.front.sepia.ceph.com] (item=/var/run/) => {"changed": false, "cmd": "find /var/run/ -name \"*ceph*\"", "delta": "0:00:00.004976", "end": "2017-09-29 18:42:23.050412", "failed": true, "failed_when_result": true, "item": "/var/run/", "rc": 1, "start": "2017-09-29 18:42:23.045436", "stderr": "find: File system loop detected; '/var/run/rpc_pipefs/gssd' is part of the same file system loop as '/var/run/rpc_pipefs'.", "stderr_lines": ["find: File system loop detected; '/var/run/rpc_pipefs/gssd' is part of the same file system loop as '/var/run/rpc_pipefs'."], "stdout": "", "stdout_lines": []}
Signed-off-by: David Galloway <dgallowa@redhat.com>
David Galloway [Fri, 6 Oct 2017 17:29:04 +0000 (13:29 -0400)]
nameserver: Double max amount of concurrent connections
I observed an unintentional DoS on ns1.front last night right as most of
the nightly scheduled jobs started up. Lots of "nf_conntrack: table
full, dropping packet" messages in the syslog.
Doubling it should be safe.
Signed-off-by: David Galloway <dgallowa@redhat.com>
David Galloway [Thu, 5 Oct 2017 20:13:31 +0000 (16:13 -0400)]
testnode: Shuffle tasks around to make sure packages install first
I moved lvm.conf in https://github.com/ceph/ceph-cm-ansible/pull/342
because I wanted all the disk management tasks clustered together. I
failed to take into account the fact that the lvm2 package might not be
installed yet (like on OVH nodes).
Signed-off-by: David Galloway <dgallowa@redhat.com>