Zack Cerza [Thu, 20 Jan 2022 21:58:48 +0000 (14:58 -0700)]
update-requirements.sh: Write intermediate file
The previous method was only writing the .txt file; when debugging
dependency issues you'll often want to see the intermediate. I don't
see a reason to not always write it.
Kyr Shatskyy [Tue, 1 Feb 2022 16:08:27 +0000 (17:08 +0100)]
kill: get machine type from paddles for the run
When calling kill for a run which is still in beanstalkd
queue there is no job directories created in archive
and there is no way to find out which machine type to use
the teuthology-kill reports that you must manually provide
machine type for the run, which is often borring and quite
inconvenient.
Instead of torturing user to recall a tube for the run name
we just ask paddles if it has anything logged and use the
machine type from what we receive from it.
Kamoltat [Tue, 25 Jan 2022 01:02:56 +0000 (01:02 +0000)]
docs/docker-compose: add ansible inventory to README
Added instructions that will help users
set up ansible inventory files in after they
built their local teuthology container. This
part is needed when ansible tasks are executed
when jobs are running.
Patrick Donnelly [Thu, 20 Jan 2022 15:11:48 +0000 (10:11 -0500)]
teuthology: convert deprecated method name
Avoiding this warning:
/home/runner/work/teuthology/teuthology/teuthology/task/install/__init__.py:285: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Dan Mick [Tue, 16 Nov 2021 01:37:23 +0000 (17:37 -0800)]
test_misc.py: fix bad assumption about LogRecord fields
The test was using LogRecord's asctime attribute to calculate a time
difference between two log entries. Although the attribute is documented
with no caveat, others have run into the problem that it does not exist
on logging.LogRecord unless a formatter with a format string referencing
{asctime} has been used. Since there's a 'created' time that's more
appropriate for this test anyway, use that instead.
This commit enables updating pytest, because pytest's logging init
code has changed: https://github.com/pytest-dev/pytest/discussions/9324
Dan Mick [Tue, 9 Nov 2021 19:49:09 +0000 (11:49 -0800)]
teuthology/packaging.py: fix build_complete: search for requested arch
The workaround from https://github.com/ceph/teuthology/pull/1649 was
necessary because my original algorithm was faulty: when searching
through all the builds for a ref/sha1, one must match the arch
requested by the call to build_complete (in the Builder object);
that arch's presence in the shaman api/search result is not enough
of a match, as it can contain multiple arches in multiple states
of build success. Only a failure *on the requested arch* should be
considered a "requested build not complete".
(note: this will still currently fail a request for a build whose
repo is complete but container build failed, as "build complete"
currently conflates those two statuses. Teuthology does not
contain the information whether a build is being requested for
packages, containers, or both.)
Also add testing for build_complete().
Fixes: https://tracker.ceph.com/issues/53205 Signed-off-by: Dan Mick <dmick@redhat.com>
Sage Weil [Thu, 28 Oct 2021 13:58:39 +0000 (08:58 -0500)]
task/internal/__init__: print core file output before splitting
Debugging this failure:
Traceback (most recent call last):
File "/home/teuthworker/src/git.ceph.com_git_teuthology_c56135d151713269e811ede3163c9743c2e269de/teuthology/run_tasks.py", line 176, in run_tasks
suppress = manager.__exit__(*exc_info)
File "/usr/lib/python3.6/contextlib.py", line 88, in __exit__
next(self.gen)
File "/home/teuthworker/src/git.ceph.com_git_teuthology_c56135d151713269e811ede3163c9743c2e269de/teuthology/task/internal/__init__.py", line 398, in archive
fetch_binaries_for_coredumps(path, rem)
File "/home/teuthworker/src/git.ceph.com_git_teuthology_c56135d151713269e811ede3163c9743c2e269de/teuthology/task/internal/__init__.py", line 320, in fetch_binaries_for_coredumps
dump_program = dump_out.split("from '")[1].split(' ')[0]
IndexError: list index out of range
...on output that should look like this:
./remote/smithi084/coredump/1635398181.133353.core: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/bin/podman stop ceph-462d7c58-37ab-11ec-8c28-001a4aab830c-node-exporter-smithi', real uid: 0, effective uid: 0, real gid: 0, effective gid: 0, execfn: '/bin/podman', platform: 'x86_64'
Zack Cerza [Wed, 20 Oct 2021 18:51:30 +0000 (12:51 -0600)]
supervisor: Don't unlock nodes w/ bad description
Very rarely, we enter a situation where nodes get used by two jobs
simultaneously. We can break this cycle if jobs refuse to unlock a
node that is locked by a different job.
This will not entirely prevent the problem, but it will keep it from
perpetuating itself.
gitbuilder has been replaced by shaman project and no longer being used,
the same functional testing we are doing with TestShamanProject,
therefore there is no need to maintain testing for obsolete code.
s/basic/default we no longer need mapping of basic to default
finally we are substituting basic as default when tring to do URI search,
we can remove this mangling and directly make use of default flavor,
we do not use flavor 'basic' for determining kernel flavor now.
since addressing comment:
FIXME: ceph flavor and kernel flavor are separate things
remove basic -> default(flavor) mapping & update s/basic/default in docs
Sage Weil [Thu, 7 Oct 2021 15:03:22 +0000 (10:03 -0500)]
tasks/kernel: add 'hwe' config flag for ubuntu distro hwe kernel
The hwe kernel supports nvme_loop, but the non-hwe kernel does not.
I don't want to futz with the 'distro' moniker (although that is another
valid approach) because only some tests need hwe, and I can imagine a
situation where we want to run tests on both kernels. Note that this
flag doesn't rule out adding support for something like
'-k distro-hwe' later.
In unlock_one, we currently have a retry mechanism that is only triggered on a particular exception. With this change, we retry the request to unlock no matter what the cause of failure.
the function `try_push_job_info()` is not
updating `job_info` dictionary properly since
we want to update `job_info` with `extra_info`,
however, in lines 498 and 499 we are assigning
`job_info` to a copy of `extra_info` and updating
`job_info` with `job_config` which is incorrect.
Instead, we should assign `job_info` with
a copy of `job_config` and update `job_info` with
`extra_info`
The previous behavior was causing machines to get nuked before any
attempt to fetch logs. If a machine took longer than 60s to become
available, collecting logs would fail. Since we also nuke after this
step, don't bother here.