teuthology/misc: make the Valgrind's early exit configurable.
This commit is a follow-up to a98eb3e1405c8ca8f6933eb0356c03955e4e2e83
where Valgrind has been configured to exit on first-seen error as it
was (wrongly!) assumed that all components are green when it comes
to the Valgrind verification.
This assumption turned out to be broken for RGW which got a few issues
over the course as a result of having the Valgrind checks knocked out
as a side effect of the python3 transition [1]. In the consequence,
multiple problems accumulated and introducing a mechanism to disable
the early exit to e.g. develop a list of these issues looks desirable.
Josh Durgin [Sun, 17 Jan 2021 02:31:33 +0000 (21:31 -0500)]
worker, run: use exact commits for teuthology and qa suite
This ensures we use the same version across all jobs in a run.
We already have suite_sha1 set by older versions of teuthology, but
for folks who haven't updated their suite command, and thus don't set
teuthology_sha1 in the job config, look up the sha1 of in the worker.
Josh Durgin [Sun, 17 Jan 2021 02:21:02 +0000 (21:21 -0500)]
repo_utils: allow fetching a specific sha1 to per-commit directories
Using a checkout of a single branch used by potentially many
workers/teuthology processes can result in errors when one job
updates the local branch while another job is reading it.
This causes issues particularly easily when using non-master
teuthology branches, and with the teuthology-dispatcher.
This also allows us to guarantee we're using the same
version across an entire run, even if e.g. the master
qa suite is updated between jobs.
Josh Durgin [Sun, 17 Jan 2021 02:04:08 +0000 (21:04 -0500)]
repo_utils: allow checking out a specific commit
Since we're cloning a particular branch with git clone --shallow,
assume we're still passed a branch that contains the commit. Otherwise
we'd waste time and space cloning all the branches in the repo.
Assume that this is only used for checking out a particular sha1 once,
to avoid repetitive work.
There are two use cases for these utilities:
1) on the user's machine when they're scheduling a suite - there it
will make sense to maintain a single checkout of a branch e.g. teuthology master
2) on the queue consumer side - here it's best if we use the same commit for an
entire run, so checking out by sha1 makes more sense
Josh Durgin [Wed, 13 Jan 2021 03:33:38 +0000 (22:33 -0500)]
exceptions: only use one of label or command for fingerprint
Commands like those running workunits include the ceph sha1 being
tested, so they're not useful for grouping. This also lets us group
together other tests if we like, for example to map tests with small
differences in configuration to the same fingerprint for sentry.
Also use the plain command, it's already a string at this point
so there's no reason to add spaces between its characters.
Josh Durgin [Mon, 4 Jan 2021 15:43:54 +0000 (10:43 -0500)]
exceptions: group CommandFailedErrors in sentry more finely
By default sentry uses the stack trace / error type / rough error
message, which ends up with many failures from different workunits
grouped together. Include the actual command run, the exit status, and
the optional label to group these more accurately. This will group
failures of the same workunit together, for example.
Kyr Shatskyy [Fri, 11 Dec 2020 17:48:16 +0000 (18:48 +0100)]
scripts: add wait script for watching run
While using teuthology-suite with --wait option it is
usefull sometimes to split the suite scheduling and
the run waiting. For example, when using tools like
Jenkins we might want to schedule a suite, report
about successful schedule and start waiting only in
the next steps.
* misc: add an optional write_to argument to misc.pull_directory()
so the caller can optionally specify the function to write
to local file.
* task/internal: add a global option "log-compress-min-size" which
defaults to "128MB". if the size of a file pulled from remote
host is greater or equal to the specified size, it will be
compressed with gzip with the extension of ".gz" before
stored in the archive directory.
Kyr Shatskyy [Mon, 30 Nov 2020 15:27:10 +0000 (16:27 +0100)]
orchestra: introduce quiet mode for remote.run
Applied changes:
- Add quiet option to remote.run and subsidiary function calls
- Logging commands now directed to DEBUG instead of INFO logger
This is usefull when we want suppress logs for some kind of commmands like
reading binary files or logging useless data to stdout/stderr as well as
dumping some vulnarable information.
Out of the box, centos 8 ssh daemon makes this file,
/etc/ssh/ssh_host_ecdsa_key.pub
containing a key of type "ecdsa-sha2-nistp256", which was
not recognized by the existing teuthology logic.
Use logic in paramiko.hostkeys to recognize the new key types.
Dan Mick [Mon, 2 Nov 2020 22:11:26 +0000 (22:11 +0000)]
Check shaman not only for repo but for build complete
The package repo might be done, but the build must be totally
complete for the container to be present. For a specific build
(in global variables, for now), demand that the build be complete
before claiming packages exist. This allows suite jobs to
fail or --newest to continue searching when the container isn't
done.
Jason Dillaman [Tue, 6 Oct 2020 20:33:43 +0000 (16:33 -0400)]
task/install/rpm: fix issued with upgrading packages under DNF
yum is an alias for dnf on modern RHEL-like systems but it's not
100% compatible. The "install" command does not upgrade packages
unless you give it a specific version to install. Additionally,
the code was incorrectly inserting two blank arguments that resulted
in "yum install" failing.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
ERROR: After October 2020 you may experience errors when installing or updating packages. This is because pip will change the way that it resolves dependency conflicts.
We recommend you use --use-feature=2020-resolver to test your packages with the new resolver before it becomes the default.
task/ssh_keys: use remote.write_file instead misc.create_file
Because misc.create_file with data arguments creates file,
changes permissions, and later appends the data using
misc.append_lines_to_file, this logic makes the algorythm
above to fail if the file created without write permissions.
2020-09-05T10:00:32.003 INFO:teuthology.task.ssh_keys:pushing keys to smithi086.front.sepia.ceph.com for ubuntu
2020-09-05T10:00:32.003 INFO:teuthology.orchestra.run.smithi086:> rm -f -- /home/ubuntu/.ssh/id_rsa
2020-09-05T10:00:32.046 INFO:teuthology.orchestra.run.smithi086:> touch /home/ubuntu/.ssh/id_rsa && chmod 500 -- /home/ubuntu/.ssh/id_rsa
2020-09-05T10:00:32.095 INFO:teuthology.orchestra.run.smithi086:> set -ex
2020-09-05T10:00:32.096 INFO:teuthology.orchestra.run.smithi086:> dd of=/home/ubuntu/.ssh/id_rsa conv=notrunc oflag=append
2020-09-05T10:00:32.140 INFO:teuthology.orchestra.run.smithi086.stderr:+ dd of=/home/ubuntu/.ssh/id_rsa conv=notrunc oflag=append
2020-09-05T10:00:32.142 INFO:teuthology.orchestra.run.smithi086.stderr:dd: failed to open '/home/ubuntu/.ssh/id_rsa': Permission denied
2020-09-05T10:00:32.142 DEBUG:teuthology.orchestra.run:got remote process result: 1
However we have more advanced remote.write_file function now,
which does not have such issues and moreover creates file
with the data provided in a single hop without trying to
download the file locally.
The yaml.safe_load reads the fail_log opened file
and shifts the offset to the end of stream.
However in case of error we need to shift offset
to the begin of the file stream, so we can read
data again.
Deprecate misc.write_file() and misc.sudo_write_file()
in favor of the orchestra.remote package methods.
The code of the misc's methods is now calling the
remote's ones.
Xiubo Li [Thu, 27 Aug 2020 05:40:52 +0000 (01:40 -0400)]
rpm: retry installing the package if the mirror server is busy
When installing some packages, if the mirror server failed with
503 code, which means the mirror server temporarily not available,
we should retry it later. But the yum tool just skips it and
retries other mirrors, which may not contain them.
For the cephfs suites, there maybe will fire the many test cases
at the same time, and for each test case it may fire several nodes
to install tens of packages at the same time. This may cause the
mirror server overloaded.
We need one safe method to retry it.
Fixes: https://tracker.ceph.com/issues/47166 Signed-off-by: Xiubo Li <xiubli@redhat.com>
Patrick Donnelly [Wed, 26 Aug 2020 14:29:27 +0000 (07:29 -0700)]
task/install: skip package removal by default
Every teuthology job cleans up its package install after completion.
This was necessary cleanup when we didn't reimage boxes before the use
of FOG as we wanted a "clean" slate for the next job to acquire the
machine. Now this is just unnecessary work which takes up valuable
machine time. For one job I looked at, this takes about 2 minutes.
We should still test that there are no unexpected issues with removing
the packages but this can be delegated to a small subset of smoke tests.
That will be posted in another PR to ceph.git.
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Kyr Shatskyy [Tue, 25 Aug 2020 07:40:03 +0000 (09:40 +0200)]
kill: find targets for killing a job
In some circumstances teuthology-kill does not unlock the nodes
if it is used to kill just a single job from a run, for example,
while a job in the middle of locking many targets the job_info
does not include targets yet, correspondingly if someone kills
the job at that point, the targets remain in locked state.
This commit address the issue.