teuthology.git
2 years agolock/query: make robust against paddles errors 1642/head
Josh Durgin [Tue, 20 Apr 2021 05:49:43 +0000 (01:49 -0400)]
lock/query: make robust against paddles errors

Retry paddles requests, and for get_status() return an empty dict
rather than None so callers behave.

get_status() failing in particular has caused the dispatcher and jobs
to fail several times over the past few weeks. With this change, we
should be able to run multiple paddles workers again, since all the
common callers will retry on error.

Signed-off-by: Josh Durgin <jdurgin@redhat.com>
3 years agoMerge pull request #1641 from tchaikov/always-be-happy
Kefu Chai [Mon, 12 Apr 2021 12:29:33 +0000 (20:29 +0800)]
Merge pull request #1641 from tchaikov/always-be-happy

task/internal: do not fail the script if systemd-sysusers core file not found

Reviewed-by: Sage Weil <sage@redhat.com>
3 years agotask/internal: do not fail the script if systemd-sysusers core file not found 1641/head
Kefu Chai [Mon, 12 Apr 2021 05:16:10 +0000 (13:16 +0800)]
task/internal: do not fail the script if systemd-sysusers core file not found

in 79f373c1769ea4f9d744cf33c5b0a0e026922d0f, we started to filter out
the systemd-sysusers core files. but the script fails if no such a file
is found, like:

2021-04-12T02:58:51.065 ERROR:teuthology.run_tasks:Manager failed: internal.coredump
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_85d61eae4759f46ce21e9a37cd816a7a1a66c9d5/teuthology/run_tasks.py", line 176, in run_tasks
    suppress = manager.__exit__(*exc_info)
  File "/usr/lib/python3.6/contextlib.py", line 88, in __exit__
    next(self.gen)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_85d61eae4759f46ce21e9a37cd816a7a1a66c9d5/teuthology/task/internal/__init__.py", line 479, in coredump
    wait=False,
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_85d61eae4759f46ce21e9a37cd816a7a1a66c9d5/teuthology/orchestra/run.py", line 479, in wait
    proc.wait()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_85d61eae4759f46ce21e9a37cd816a7a1a66c9d5/teuthology/orchestra/run.py", line 161, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_85d61eae4759f46ce21e9a37cd816a7a1a66c9d5/teuthology/orchestra/run.py", line 183, in _raise_for_status
    node=self.hostname, label=self.label
teuthology.exceptions.CommandFailedError: Command failed on smithi165 with status 1: "sudo sysctl -w kernel.core_pattern=core && sudo bash -c 'for f in `find /home/ubuntu/cephtest/archive/coredump
-type f`; do file $f | grep -q systemd-sysusers && rm $f ; done' && rmdir --ignore-fail-on-non-empty -- /home/ubuntu/cephtest/archive/coredump"

in this change, we ensure that the script never fails by adding `|| true`.

Signed-off-by: Kefu Chai <kchai@redhat.com>
3 years agotask/internal: split embedded shell into lines
Kefu Chai [Mon, 12 Apr 2021 05:15:24 +0000 (13:15 +0800)]
task/internal: split embedded shell into lines

for better readability

Signed-off-by: Kefu Chai <kchai@redhat.com>
3 years agoMerge PR #1634 into master
Patrick Donnelly [Wed, 31 Mar 2021 18:06:08 +0000 (11:06 -0700)]
Merge PR #1634 into master

* refs/pull/1634/head:
orchestra/remote: extend mktemp() to accept data

Reviewed-by: Kyr Shatskyy <kyrylo.shatskyy@suse.com>
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
3 years agoorchestra/remote: extend mktemp() to accept data 1634/head
Rishabh Dave [Fri, 26 Mar 2021 09:26:11 +0000 (14:56 +0530)]
orchestra/remote: extend mktemp() to accept data

Extend remote.Remote.mktemp() to accept data as a parameter and write
the data to the temporary file after it is created.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
3 years agoMerge pull request #1636 from ideepika/fix-interactive-error
Josh Durgin [Mon, 29 Mar 2021 21:58:24 +0000 (14:58 -0700)]
Merge pull request #1636 from ideepika/fix-interactive-error

check ctx.archive is present or not in yaml config

Reviewed-by: Kefu Chai <kchai@redhat.com>
Reviewed-by: Josh Durgin <jdurgin@redhat.com>
3 years agocheck ctx.archive is present or not in yaml config 1636/head
Deepika Upadhyay [Mon, 29 Mar 2021 14:46:51 +0000 (20:16 +0530)]
check ctx.archive is present or not in yaml config

this specifically is for interactive on error mode where we usually do
not specify archive_path which fails without this check

Signed-off-by: Deepika Upadhyay <dupadhya@redhat.com>
3 years agoMerge pull request #1633 from jdurgin/wip-retry-paddles-writes
Josh Durgin [Thu, 25 Mar 2021 17:05:33 +0000 (10:05 -0700)]
Merge pull request #1633 from jdurgin/wip-retry-paddles-writes

report, lock.ops: retry write requests to paddles

Reviewed-by: Neha Ojha <nojha@redhat.com>
3 years agoreport, lock.ops: retry write requests to paddles wip-retry-paddles-writes 1633/head
Josh Durgin [Sun, 21 Mar 2021 22:28:52 +0000 (18:28 -0400)]
report, lock.ops: retry write requests to paddles

For more contended cases of updating job status and machine keys,
where we've seen 500 errors from DB conflicts, use random intervals
for the retries.

This is the teuthology half of fixing:
https://tracker.ceph.com/issues/49864

Signed-off-by: Josh Durgin <jdurgin@redhat.com>
3 years agoMerge pull request #1632 from ceph/revert-nuke
Sage Weil [Sun, 21 Mar 2021 18:16:35 +0000 (13:16 -0500)]
Merge pull request #1632 from ceph/revert-nuke

Revert "Merge pull request #1631 from jdurgin/wip-nuke-poweroff"

3 years agoRevert "Merge pull request #1631 from jdurgin/wip-nuke-poweroff" 1632/head
Sage Weil [Sun, 21 Mar 2021 16:39:13 +0000 (11:39 -0500)]
Revert "Merge pull request #1631 from jdurgin/wip-nuke-poweroff"

This reverts commit c48eb744081d22bc82d7d099d4edb67ae02551e0, reversing
changes made to b96569170f15eae4604f361990ea65737b28dff1.

This is causing log gzipping to fail because the logs already exist as .gz files.
My guess is that the logs are left over from previous, but I'm not sure how
that would happen.

In any case, the merge of this PR corresponds exactly to when we started seeing
the log gzip failures.

Signed-off-by: Sage Weil <sage@newdream.net>
3 years agoMerge pull request #1631 from jdurgin/wip-nuke-poweroff
Josh Durgin [Fri, 19 Mar 2021 22:50:18 +0000 (15:50 -0700)]
Merge pull request #1631 from jdurgin/wip-nuke-poweroff

nuke: don't power-off machines when not rebooting

Reviewed-by: Neha Ojha <nojha@redhat.com>
3 years agonuke: don't power-off machines when not rebooting 1631/head
Josh Durgin [Fri, 19 Mar 2021 21:01:20 +0000 (21:01 +0000)]
nuke: don't power-off machines when not rebooting

This ensures jobs that time out can still have their logs gathered.

Signed-off-by: Josh Durgin <jdurgin@redhat.com>
3 years agoMerge pull request #1628 from ceph/ignore-systemd-sysusers-core
Josh Durgin [Sat, 13 Mar 2021 03:30:42 +0000 (19:30 -0800)]
Merge pull request #1628 from ceph/ignore-systemd-sysusers-core

task/internal: ignore systemd-sysusers core file

Reviewed-by: Josh Durgin <jdurgin@redhat.com>
3 years agotask/internal: ignore systemd-sysusers core file 1628/head
Sage Weil [Fri, 12 Mar 2021 17:58:47 +0000 (11:58 -0600)]
task/internal: ignore systemd-sysusers core file

This is related to dnsmasq.  When installing hte kubic podman 3.0.1
packages,

  Running scriptlet: dnsmasq-2.79-13.el8_3.1.x86_64                                                                                                                                                                                                                                                                                                                                                                                                                            14/16
/var/tmp/rpm-tmp.6MFp00: line 5:  9079 Segmentation fault      (core dumped) systemd-sysusers -  &> /dev/null <<SYSTEMD_INLINE_EOF
u dnsmasq - "Dnsmasq DHCP and DNS server" /var/lib/dnsmasq
SYSTEMD_INLINE_EOF

  Installing       : dnsmasq-2.79-13.el8_3.1.x86_64                                                                                                                                                                                                                                                                                                                                                                                                                            14/16
warning: group dnsmasq does not exist - using root
warning: group dnsmasq does not exist - using root
warning: group dnsmasq does not exist - using root

  Running scriptlet: dnsmasq-2.79-13.el8_3.1.x86_64                                                                                                                                                                                                                                                                                                                                                                                                                            14/16
/var/tmp/rpm-tmp.pfCGxn: line 3:  9089 Segmentation fault      (core dumped) systemd-sysusers &> /dev/null

  Installing       : podman-3.0.1-2.el8.3.2.x86_64                                                                                                                                                                                                                                                                                                                                                                                                                             15/16
  Installing       : podman-plugins-3.0.1-2.el8.3.2.x86_64                                                                                                                                                                                                                                                                                                                                                                                                                     16/16
  Running scriptlet: container-selinux-2:2.145.0-1.el8.noarch                                                                                                                                                                                                                                                                                                                                                                                                                  16/16
  Running scriptlet: podman-plugins-3.0.1-2.el8.3.2.x86_64                                                                                                                                                                                                                                                                                                                                                                                                                     16/16
/var/tmp/rpm-tmp.bFfmjl: line 6: 11098 Segmentation fault      (core dumped) /usr/bin/systemd-sysusers
warning: %triggerin(systemd-239-18.el8.x86_64) scriptlet failed, exit status 139

Error in <unknown> scriptlet in rpm package podman-plugins
  Verifying        : dnsmasq-2.79-13.el8_3.1.x86_64                                                                                                                                                                                                                                                                                                                                                                                                                             1/16

Nothing to do with us.

Signed-off-by: Sage Weil <sage@newdream.net>
3 years agoMerge pull request #1573 from smithfarm/wip-45570
kyr [Fri, 12 Mar 2021 09:20:23 +0000 (10:20 +0100)]
Merge pull request #1573 from smithfarm/wip-45570

orchestra/console: raise RuntimeError when fail to power on

3 years agoMerge pull request #1627 from ceph/wip-debug-levels
Josh Durgin [Thu, 11 Mar 2021 16:48:34 +0000 (08:48 -0800)]
Merge pull request #1627 from ceph/wip-debug-levels

suite/placeholder.py: lower osd specific debug levels

Reviewed-by: Josh Durgin <jdurgin@redhat.com>
3 years agosuite/placeholder.py: lower osd specific debug levels 1627/head
Neha Ojha [Wed, 10 Mar 2021 23:33:55 +0000 (23:33 +0000)]
suite/placeholder.py: lower osd specific debug levels

Signed-off-by: Neha Ojha <nojha@redhat.com>
3 years agoMerge pull request #1620 from ceph/wip-badone-ceph-ansible-tracker-49485
Brad Hubbard [Tue, 9 Mar 2021 22:20:21 +0000 (08:20 +1000)]
Merge pull request #1620 from ceph/wip-badone-ceph-ansible-tracker-49485

ceph_ansible: Satisfy 'six' dependency

Reviewed-by: Yuri Weinstein <yweins@redhat.com>
3 years agoselinux: fix typo
Sage Weil [Sat, 27 Feb 2021 20:13:30 +0000 (14:13 -0600)]
selinux: fix typo

Signed-off-by: Sage Weil <sage@newdream.net>
3 years agoMerge pull request #1622 from ceph/ignore-selinux-sssd
Sage Weil [Sat, 27 Feb 2021 17:39:02 +0000 (11:39 -0600)]
Merge pull request #1622 from ceph/ignore-selinux-sssd

selinux: ignore issues with sssd

3 years agoselinux: ignore issues with sssd 1622/head
Sage Weil [Sat, 27 Feb 2021 15:26:36 +0000 (09:26 -0600)]
selinux: ignore issues with sssd

['type=AVC msg=audit(1614438637.552:5615): avc: denied { read } for pid=876 comm="sssd" name="resolv.conf" dev="sda1" ino=265261 scontext=system_u:system_r:sssd_t:s0 tcontext=unconfined_u:object_r:admin_home_t:s0 tclass=file permissive=1']

(currently seen on rhel 8.3)

Signed-off-by: Sage Weil <sage@newdream.net>
3 years agoMerge pull request #1621 from kshtsk/wip-math-gcd
kyr [Fri, 26 Feb 2021 22:49:30 +0000 (23:49 +0100)]
Merge pull request #1621 from kshtsk/wip-math-gcd

suite/matrix: use math.gcd instead of fractions.gcd

3 years agorequirements: use ansible 2.9 1621/head
Kyr Shatskyy [Fri, 26 Feb 2021 14:13:59 +0000 (15:13 +0100)]
requirements: use ansible 2.9

Signed-off-by: Kyr Shatskyy <kyrylo.shatskyy@suse.com>
3 years agorequirements: bump up cffi to 1.14.5
Kyr Shatskyy [Fri, 26 Feb 2021 10:20:28 +0000 (11:20 +0100)]
requirements: bump up cffi to 1.14.5

Needs for run on Big Sur with python3.9 from brew and addresses
building error for cffi wheel:

    clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -DUSE__THREAD -DHAVE_SYNC_SYNCHRONIZE -I/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/ffi -I/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/ffi -I/opt/homebrew/include -I/opt/homebrew/opt/openssl@1.1/include -I/opt/homebrew/opt/sqlite/include -I/opt/homebrew/opt/tcl-tk/include -I/Users/kyr/kshtsk/teuthology/virtualenv/include -I/opt/homebrew/Cellar/python@3.9/3.9.2_1/Frameworks/Python.framework/Versions/3.9/include/python3.9 -c c/_cffi_backend.c -o build/temp.macosx-11-arm64-3.9/c/_cffi_backend.o
    c/_cffi_backend.c:6185:5: warning: 'PyEval_InitThreads' is deprecated [-Wdeprecated-declarations]
        PyEval_InitThreads();
        ^
    /opt/homebrew/Cellar/python@3.9/3.9.2_1/Frameworks/Python.framework/Versions/3.9/include/python3.9/ceval.h:130:1: note: 'PyEval_InitThreads' has been explicitly marked deprecated here
    Py_DEPRECATED(3.9) PyAPI_FUNC(void) PyEval_InitThreads(void);
    ^
    /opt/homebrew/Cellar/python@3.9/3.9.2_1/Frameworks/Python.framework/Versions/3.9/include/python3.9/pyport.h:508:54: note: expanded from macro 'Py_DEPRECATED'
    #define Py_DEPRECATED(VERSION_UNUSED) __attribute__((__deprecated__))
                                                         ^
    c/_cffi_backend.c:6245:9: error: implicit declaration of function 'ffi_prep_closure' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
        if (ffi_prep_closure(closure, &cif_descr->cif,
            ^
    1 warning and 1 error generated.
    error: command '/usr/bin/clang' failed with exit code 1

Signed-off-by: Kyr Shatskyy <kyrylo.shatskyy@suse.com>
3 years agorequirements.in: stick ansible version to 2.8 version
Kyr Shatskyy [Fri, 26 Feb 2021 13:13:31 +0000 (14:13 +0100)]
requirements.in: stick ansible version to 2.8 version

Since we are not ready for ansible 3 from ceph-cm-ansible point of view:

  2021-02-26T12:45:17.668 INFO:teuthology.task.ansible.out:ERROR! couldn't resolve module/action 'firewalld'. This often indicates a misspelling, missing collection, or incorrect module path.

Signed-off-by: Kyr Shatskyy <kyrylo.shatskyy@suse.com>
3 years agorequirements.in: stick pytest to 3.7.1 version
Kyr Shatskyy [Fri, 26 Feb 2021 11:57:29 +0000 (12:57 +0100)]
requirements.in: stick pytest to 3.7.1 version

Untill someone fixes unittests.

Signed-off-by: Kyr Shatskyy <kyrylo.shatskyy@suse.com>
3 years agosuite/matrix: latest py3 deprecates fractions.gcd
Kyr Shatskyy [Fri, 26 Feb 2021 10:19:46 +0000 (11:19 +0100)]
suite/matrix: latest py3 deprecates fractions.gcd

Signed-off-by: Kyrylo Shatskyy <kyr@top.local>
3 years agoceph_ansible: Satisfy 'six' dependency 1620/head
Brad Hubbard [Thu, 25 Feb 2021 08:38:31 +0000 (18:38 +1000)]
ceph_ansible: Satisfy 'six' dependency

Fixes: https://tracker.ceph.com/issues/49485
Signed-off-by: Brad Hubbard <bhubbard@redhat.com>
3 years agoMerge pull request #1618 from ceph/valgrind-soname 1.1.0
Sage Weil [Thu, 18 Feb 2021 21:58:10 +0000 (15:58 -0600)]
Merge pull request #1618 from ceph/valgrind-soname

misc: make valgrind behave with tcmalloc

3 years agomisc: make valgrind behave with tcmalloc 1618/head
Sage Weil [Thu, 18 Feb 2021 16:23:14 +0000 (10:23 -0600)]
misc: make valgrind behave with tcmalloc

Signed-off-by: Sage Weil <sage@newdream.net>
3 years agoMerge pull request #1617 from ceph/no-fsid-for-state
Sage Weil [Thu, 18 Feb 2021 14:43:46 +0000 (08:43 -0600)]
Merge pull request #1617 from ceph/no-fsid-for-state

orchestra/daemon/state: do not pass fsid property to run() later

3 years agoorchestra/daemon/state: do not pass fsid property to run() later 1617/head
Sage Weil [Wed, 17 Feb 2021 18:47:45 +0000 (13:47 -0500)]
orchestra/daemon/state: do not pass fsid property to run() later

Signed-off-by: Sage Weil <sage@newdream.net>
3 years agoMerge pull request #1616 from ceph/ignore-signal-exceptions
Sage Weil [Wed, 17 Feb 2021 15:59:14 +0000 (09:59 -0600)]
Merge pull request #1616 from ceph/ignore-signal-exceptions

orchestra/daemon/cephadmunit: ignore exception when sending signal

3 years agoorchestra/daemon/cephadmunit: ignore exception when sending signal 1616/head
Sage Weil [Wed, 17 Feb 2021 03:27:32 +0000 (21:27 -0600)]
orchestra/daemon/cephadmunit: ignore exception when sending signal

The osd thrashing is sending lots of signals (sighup) and can easily race with
a daemon shutting down entirely.

This makes us match the behavior of the original state.py signal() method.

Signed-off-by: Sage Weil <sage@newdream.net>
3 years agoMerge pull request #1615 from jdurgin/wip-debug-ms
Josh Durgin [Tue, 16 Feb 2021 01:38:24 +0000 (17:38 -0800)]
Merge pull request #1615 from jdurgin/wip-debug-ms

suite: lower debug_ms for osd back to 1

Reviewed-by: Neha Ojha <nojha@redhat.com>
3 years agosuite: lower debug_ms for osd back to 1 1615/head
Josh Durgin [Tue, 16 Feb 2021 00:15:45 +0000 (19:15 -0500)]
suite: lower debug_ms for osd back to 1

This was increased for some mgr issues in
044384be450a557f56a2b39bf7d0e71e69d45cd3, but isn't helping much now
and is filling up disks for long-running tests.

Signed-off-by: Josh Durgin <jdurgin@redhat.com>
3 years agoMerge pull request #1614 from jdurgin/wip-nuke-tests
Josh Durgin [Sat, 13 Feb 2021 00:29:38 +0000 (16:29 -0800)]
Merge pull request #1614 from jdurgin/wip-nuke-tests

nuke: fix no_reboot only being present in the cli and add unit tests

Reviewed-by: Neha Ojha <nojha@redhat.com>
3 years agotest_nuke: add unit tests for internal nuke options 1614/head
Josh Durgin [Fri, 12 Feb 2021 22:54:17 +0000 (22:54 +0000)]
test_nuke: add unit tests for internal nuke options

Signed-off-by: Josh Durgin <jdurgin@redhat.com>
3 years agonuke: only use no_reboot on the cli
Josh Durgin [Fri, 12 Feb 2021 22:53:38 +0000 (22:53 +0000)]
nuke: only use no_reboot on the cli

Signed-off-by: Josh Durgin <jdurgin@redhat.com>
3 years agoMerge pull request #1613 from jdurgin/wip-nuke-keep-logs
Josh Durgin [Fri, 12 Feb 2021 18:29:47 +0000 (10:29 -0800)]
Merge pull request #1613 from jdurgin/wip-nuke-keep-logs

nuke: only use keep_logs from the cli

Reviewed-by: Neha Ojha <nojha@redhat.com>
3 years agonuke: only use keep_logs from the cli 1613/head
Josh Durgin [Thu, 11 Feb 2021 22:59:52 +0000 (22:59 +0000)]
nuke: only use keep_logs from the cli

nuke() is called outside of the cli with a ctx that does not include
all the cli args. Use a default parameter for the functions instead of ctx.

Signed-off-by: Josh Durgin <jdurgin@redhat.com>
3 years agoMerge pull request #1612 from ceph/nicer-ls
Sage Weil [Thu, 11 Feb 2021 22:43:28 +0000 (16:43 -0600)]
Merge pull request #1612 from ceph/nicer-ls

ls: nicer ls output

3 years agols: nicer ls output 1612/head
Sage Weil [Wed, 10 Feb 2021 22:22:53 +0000 (22:22 +0000)]
ls: nicer ls output

- no error when teuthology.log is missing (provisioning)
- leave off pid

Signed-off-by: Sage Weil <sage@redhat.com>
3 years agoMerge pull request #1611 from ceph/dependabot/pip/cryptography-3.3.2
kyr [Thu, 11 Feb 2021 14:15:57 +0000 (15:15 +0100)]
Merge pull request #1611 from ceph/dependabot/pip/cryptography-3.3.2

build(deps): bump cryptography from 3.2 to 3.3.2

3 years agobuild(deps): bump cryptography from 3.2 to 3.3.2 1611/head
dependabot[bot] [Thu, 11 Feb 2021 14:09:41 +0000 (14:09 +0000)]
build(deps): bump cryptography from 3.2 to 3.3.2

Bumps [cryptography](https://github.com/pyca/cryptography) from 3.2 to 3.3.2.
- [Release notes](https://github.com/pyca/cryptography/releases)
- [Changelog](https://github.com/pyca/cryptography/blob/master/CHANGELOG.rst)
- [Commits](https://github.com/pyca/cryptography/compare/3.2...3.3.2)

Signed-off-by: dependabot[bot] <support@github.com>
3 years agoMerge pull request #1609 from ceph/dependabot/pip/httplib2-0.19.0
kyr [Thu, 11 Feb 2021 14:07:51 +0000 (15:07 +0100)]
Merge pull request #1609 from ceph/dependabot/pip/httplib2-0.19.0

build(deps): bump httplib2 from 0.18.0 to 0.19.0

3 years agoMerge pull request #1610 from jdurgin/wip-supervisor-timeouts
Josh Durgin [Tue, 9 Feb 2021 22:10:00 +0000 (14:10 -0800)]
Merge pull request #1610 from jdurgin/wip-supervisor-timeouts

supervisor: improve error handling for dead jobs

Reviewed-by: Andrew Schoen <aschoen@redhat.com>
3 years agosupervisor: send paddles the reason a jobs is marked dead 1610/head
Josh Durgin [Tue, 9 Feb 2021 21:33:34 +0000 (21:33 +0000)]
supervisor: send paddles the reason a jobs is marked dead

Signed-off-by: Josh Durgin <jdurgin@redhat.com>
3 years agosupervisor: kill processes before gathering logs
Josh Durgin [Tue, 9 Feb 2021 21:16:46 +0000 (21:16 +0000)]
supervisor: kill processes before gathering logs

When we hit the max job timeout, we need to stop the test programs
before collecting logs or else we run into errors like 'file size
changed while zipping' trying to compress them, and we can't save them
or stop the job.

Signed-off-by: Josh Durgin <jdurgin@redhat.com>
3 years agonuke: allow not rebooting again
Josh Durgin [Tue, 9 Feb 2021 19:24:02 +0000 (19:24 +0000)]
nuke: allow not rebooting again

The default behavior was changed to always reboot in
1d47a121b385e2656e9314e9d63faf68a8e865e4 but the --reboot-all option
remained. Keep the original option around for compatibility with
existing scripts.

Signed-off-by: Josh Durgin <jdurgin@redhat.com>
3 years agonuke: add option to preserve logs on remote machines
Josh Durgin [Tue, 9 Feb 2021 18:54:28 +0000 (18:54 +0000)]
nuke: add option to preserve logs on remote machines

This will be helpful for killing jobs that hit the max_job_timeout
while still being able to collect logs from them.

Signed-off-by: Josh Durgin <jdurgin@redhat.com>
3 years agobuild(deps): bump httplib2 from 0.18.0 to 0.19.0 1609/head
dependabot[bot] [Mon, 8 Feb 2021 20:52:33 +0000 (20:52 +0000)]
build(deps): bump httplib2 from 0.18.0 to 0.19.0

Bumps [httplib2](https://github.com/httplib2/httplib2) from 0.18.0 to 0.19.0.
- [Release notes](https://github.com/httplib2/httplib2/releases)
- [Changelog](https://github.com/httplib2/httplib2/blob/master/CHANGELOG)
- [Commits](https://github.com/httplib2/httplib2/compare/v0.18.0...v0.19.0)

Signed-off-by: dependabot[bot] <support@github.com>
3 years agoMerge pull request #1608 from kshtsk/fix-docs
David Galloway [Fri, 5 Feb 2021 17:37:38 +0000 (12:37 -0500)]
Merge pull request #1608 from kshtsk/fix-docs

readme: fix teuthology docs link at docs.ceph.com

3 years agoMerge pull request #1601 from sebastian-philipp/prio-add-job-count
kyr [Fri, 5 Feb 2021 17:19:24 +0000 (18:19 +0100)]
Merge pull request #1601 from sebastian-philipp/prio-add-job-count

teuthology-suite: Add job count to priority error msg.

3 years agoreadme: fix teuthology docs link at docs.ceph.com 1608/head
Kyr Shatskyy [Fri, 5 Feb 2021 17:15:49 +0000 (18:15 +0100)]
readme: fix teuthology docs link at docs.ceph.com

Signed-off-by: Kyr Shatskyy <kyrylo.shatskyy@suse.com>
3 years agoMerge pull request #1607 from kshtsk/ver-1.1.0
kyr [Fri, 5 Feb 2021 16:19:02 +0000 (17:19 +0100)]
Merge pull request #1607 from kshtsk/ver-1.1.0

version: increase version to 1.1.0 since we have dispatcher

3 years agoMerge pull request #1606 from kshtsk/supervisor-log
kyr [Fri, 5 Feb 2021 16:11:44 +0000 (17:11 +0100)]
Merge pull request #1606 from kshtsk/supervisor-log

dispatcher: add .log extension for supervisor log

3 years agoversion: increase version to 1.1.0 since we have dispatcher 1607/head
Kyr Shatskyy [Fri, 5 Feb 2021 16:09:40 +0000 (17:09 +0100)]
version: increase version to 1.1.0 since we have dispatcher

Signed-off-by: Kyr Shatskyy <kyrylo.shatskyy@suse.com>
3 years agodispatcher: add .log extension for supervisor log 1606/head
Kyr Shatskyy [Fri, 5 Feb 2021 16:04:52 +0000 (17:04 +0100)]
dispatcher: add .log extension for supervisor log

It would be great to have an extension for easy log identification.

Signed-off-by: Kyr Shatskyy <kyrylo.shatskyy@suse.com>
3 years agoMerge pull request #1605 from jdurgin/wip-supervisor-connect-error
Dan Mick [Thu, 4 Feb 2021 23:09:08 +0000 (15:09 -0800)]
Merge pull request #1605 from jdurgin/wip-supervisor-connect-error

dispatcher/supervisor: always unlock machines and save status

3 years agodispatcher/supervisor: always unlock machines and save status 1605/head
Josh Durgin [Thu, 4 Feb 2021 22:56:53 +0000 (17:56 -0500)]
dispatcher/supervisor: always unlock machines and save status

If we can't connect to the machines anymore, we still need to clean
up.

Signed-off-by: Josh Durgin <jdurgin@redhat.com>
3 years agoMerge pull request #1604 from jdurgin/wip-dispatcher-commit-bug
Josh Durgin [Tue, 2 Feb 2021 20:04:42 +0000 (12:04 -0800)]
Merge pull request #1604 from jdurgin/wip-dispatcher-commit-bug

dispatcher/repo_utils: handle missing commits better

Reviewed-by: David Galloway <dgallowa@redhat.com>
3 years agodispatcher: keep operating if preparing a job fails 1604/head
Josh Durgin [Tue, 2 Feb 2021 19:57:34 +0000 (14:57 -0500)]
dispatcher: keep operating if preparing a job fails

prep_job() handles updating the job status already.

Signed-off-by: Josh Durgin <jdurgin@redhat.com>
3 years agorepo_utils: clone entire branch if commit is specified
Josh Durgin [Tue, 2 Feb 2021 19:48:47 +0000 (14:48 -0500)]
repo_utils: clone entire branch if commit is specified

If the commit is not the head of the branch, we need more history to be
able to check it out.

Signed-off-by: Josh Durgin <jdurgin@redhat.com>
3 years agoworker: handle CommitNotFoundErrors
Josh Durgin [Tue, 2 Feb 2021 19:19:07 +0000 (14:19 -0500)]
worker: handle CommitNotFoundErrors

Signed-off-by: Josh Durgin <jdurgin@redhat.com>
3 years agoMerge pull request #1603 from jdurgin/wip-dispatcher-bug
Josh Durgin [Tue, 2 Feb 2021 16:52:22 +0000 (08:52 -0800)]
Merge pull request #1603 from jdurgin/wip-dispatcher-bug

dispatcher: allow empty os_type for fake config

Reviewed-by: David Galloway <dgallowa@redhat.com>
3 years agodispatcher: allow empty os_type for fake config 1603/head
Josh Durgin [Tue, 2 Feb 2021 15:06:11 +0000 (10:06 -0500)]
dispatcher: allow empty os_type for fake config

This is the same default as reimaging uses,
though it's not too important in the supervisor.

Signed-off-by: Josh Durgin <jdurgin@redhat.com>
3 years agoteuthology-suite: Add job count to priority error msg. 1601/head
Sebastian Wagner [Tue, 26 Jan 2021 12:02:35 +0000 (13:02 +0100)]
teuthology-suite: Add job count to priority error msg.

Don't let users guess the job count.

Signed-off-by: Sebastian Wagner <sebastian.wagner@suse.com>
3 years agoMerge pull request #1546 from ShraddhaAg/add-minimal-dispatcher
Josh Durgin [Thu, 28 Jan 2021 01:28:30 +0000 (17:28 -0800)]
Merge pull request #1546 from ShraddhaAg/add-minimal-dispatcher

Add teuthology-dispatcher

Reviewed-by: Josh Durgin <jdurgin@redhat.com>
3 years agoMerge pull request #1599 from ceph/wip-exact-commits
Josh Durgin [Wed, 27 Jan 2021 16:01:59 +0000 (08:01 -0800)]
Merge pull request #1599 from ceph/wip-exact-commits

Use the same version of teuthology and ceph-qa-suite for a whole run

Reviewed-by: Kyr Shatskyy <kyrylo.shatskyy@suse.com>
3 years agoMerge pull request #1598 from SrinivasaBharath/wip-deb-rm
Vasu Kulkarni [Thu, 21 Jan 2021 16:01:35 +0000 (08:01 -0800)]
Merge pull request #1598 from SrinivasaBharath/wip-deb-rm

task/install/redhat: Removing packages based on OS in cleanup task

3 years agoMerge branch 'master' into add-minimal-dispatcher 1546/head
Josh Durgin [Wed, 20 Jan 2021 18:16:12 +0000 (10:16 -0800)]
Merge branch 'master' into add-minimal-dispatcher

3 years agoMerge pull request #1600 from rzarzynski/wip-valgrind-controllable-exit
Josh Durgin [Tue, 19 Jan 2021 19:06:45 +0000 (11:06 -0800)]
Merge pull request #1600 from rzarzynski/wip-valgrind-controllable-exit

teuthology/misc: make the Valgrind's early exit configurable.

Reviewed-by: Josh Durgin <jdurgin@redhat.com>
3 years agoteuthology/misc: make the Valgrind's early exit configurable. 1600/head
Radoslaw Zarzynski [Tue, 19 Jan 2021 13:56:32 +0000 (14:56 +0100)]
teuthology/misc: make the Valgrind's early exit configurable.

This commit is a follow-up to a98eb3e1405c8ca8f6933eb0356c03955e4e2e83
where Valgrind has been configured to exit on first-seen error as it
was (wrongly!) assumed that all components are green when it comes
to the Valgrind verification.
This assumption turned out to be broken for RGW which got a few issues
over the course as a result of having the Valgrind checks knocked out
as a side effect of the python3 transition [1]. In the consequence,
multiple problems accumulated and introducing a mechanism to disable
the early exit to e.g. develop a list of these issues looks desirable.

[1]: https://github.com/ceph/teuthology/pull/1503#issuecomment-762837504

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
3 years agoworker, run: use exact commits for teuthology and qa suite 1599/head
Josh Durgin [Sun, 17 Jan 2021 02:31:33 +0000 (21:31 -0500)]
worker, run: use exact commits for teuthology and qa suite

This ensures we use the same version across all jobs in a run.

We already have suite_sha1 set by older versions of teuthology, but
for folks who haven't updated their suite command, and thus don't set
teuthology_sha1 in the job config, look up the sha1 of in the worker.

Signed-off-by: Josh Durgin <jdurgin@redhat.com>
3 years agorepo_utils: allow fetching a specific sha1 to per-commit directories
Josh Durgin [Sun, 17 Jan 2021 02:21:02 +0000 (21:21 -0500)]
repo_utils: allow fetching a specific sha1 to per-commit directories

Using a checkout of a single branch used by potentially many
workers/teuthology processes can result in errors when one job
updates the local branch while another job is reading it.

This causes issues particularly easily when using non-master
teuthology branches, and with the teuthology-dispatcher.

This also allows us to guarantee we're using the same
version across an entire run, even if e.g. the master
qa suite is updated between jobs.

Signed-off-by: Josh Durgin <jdurgin@redhat.com>
3 years agosuite: pass the commit of teuthology with each job's config
Josh Durgin [Sun, 17 Jan 2021 02:16:59 +0000 (21:16 -0500)]
suite: pass the commit of teuthology with each job's config

We already pass the suite sha1, but do not use it yet. This is the
missing piece to be able to use the same version of everything across
a whole run.

Signed-off-by: Josh Durgin <jdurgin@redhat.com>
3 years agorepo_utils: allow checking out a specific commit
Josh Durgin [Sun, 17 Jan 2021 02:04:08 +0000 (21:04 -0500)]
repo_utils: allow checking out a specific commit

Since we're cloning a particular branch with git clone --shallow,
assume we're still passed a branch that contains the commit. Otherwise
we'd waste time and space cloning all the branches in the repo.

Assume that this is only used for checking out a particular sha1 once,
to avoid repetitive work.

There are two use cases for these utilities:

1) on the user's machine when they're scheduling a suite - there it
will make sense to maintain a single checkout of a branch e.g. teuthology master

2) on the queue consumer side - here it's best if we use the same commit for an
entire run, so checking out by sha1 makes more sense

Signed-off-by: Josh Durgin <jdurgin@redhat.com>
3 years agotask/install/redhat: Removing packages based on OS in cleanup task 1598/head
srinivasabharath [Wed, 13 Jan 2021 06:30:01 +0000 (01:30 -0500)]
task/install/redhat: Removing packages based on OS in cleanup task

Signed-off-by: Bharath <skanta@redhat.com>
3 years agoMerge pull request #1597 from jdurgin/wip-workunit-fingerprint
Josh Durgin [Thu, 14 Jan 2021 21:00:42 +0000 (13:00 -0800)]
Merge pull request #1597 from jdurgin/wip-workunit-fingerprint

exceptions: only use one of label or command for fingerprint

Reviewed-by: Neha Ojha <nojha@redhat.com>
3 years agoexceptions: only use one of label or command for fingerprint 1597/head
Josh Durgin [Wed, 13 Jan 2021 03:33:38 +0000 (22:33 -0500)]
exceptions: only use one of label or command for fingerprint

Commands like those running workunits include the ceph sha1 being
tested, so they're not useful for grouping. This also lets us group
together other tests if we like, for example to map tests with small
differences in configuration to the same fingerprint for sentry.

Also use the plain command, it's already a string at this point
so there's no reason to add spaces between its characters.

Signed-off-by: Josh Durgin <jdurgin@redhat.com>
3 years agoMerge PR #1595 into master
Patrick Donnelly [Tue, 12 Jan 2021 15:31:46 +0000 (07:31 -0800)]
Merge PR #1595 into master

* refs/pull/1595/head:
orchestra: squelch Traceback for expected auth failures

Reviewed-by: Neha Ojha <nojha@redhat.com>
Reviewed-by: Kyr Shatskyy <kyrylo.shatskyy@suse.com>
3 years agoorchestra: squelch Traceback for expected auth failures 1595/head
Patrick Donnelly [Fri, 8 Jan 2021 18:08:04 +0000 (10:08 -0800)]
orchestra: squelch Traceback for expected auth failures

The Traceback clutters the log and messes up greps for Tracebacks.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
3 years agoMerge pull request #1593 from jdurgin/wip-sentry-sdk
Josh Durgin [Wed, 6 Jan 2021 15:46:50 +0000 (07:46 -0800)]
Merge pull request #1593 from jdurgin/wip-sentry-sdk

sentry: use new library and group CommandFailedErrors more finely

Reviewed-by: Kyr Shatskyy <kyrylo.shatskyy@suse.com>
3 years agoexceptions: group CommandFailedErrors in sentry more finely 1593/head
Josh Durgin [Mon, 4 Jan 2021 15:43:54 +0000 (10:43 -0500)]
exceptions: group CommandFailedErrors in sentry more finely

By default sentry uses the stack trace / error type / rough error
message, which ends up with many failures from different workunits
grouped together. Include the actual command run, the exit status, and
the optional label to group these more accurately. This will group
failures of the same workunit together, for example.

Signed-off-by: Josh Durgin <jdurgin@redhat.com>
3 years agorun_tasks: use new sentry_sdk
Josh Durgin [Mon, 4 Jan 2021 15:39:38 +0000 (10:39 -0500)]
run_tasks: use new sentry_sdk

raven has been deprecated

Signed-off-by: Josh Durgin <jdurgin@redhat.com>
3 years agoMerge pull request #1591 from rakeshgm/typo-correct
kyr [Thu, 17 Dec 2020 10:06:36 +0000 (11:06 +0100)]
Merge pull request #1591 from rakeshgm/typo-correct

run_tasks: correct typo tuethology -> teuthology

3 years agorun_tasks: correct typo tuethology -> teuthology 1591/head
rakeshgm [Thu, 17 Dec 2020 08:31:26 +0000 (14:01 +0530)]
run_tasks: correct typo tuethology -> teuthology

Signed-off-by: rakeshgm <rakeshgm@redhat.com>
3 years agoMerge pull request #1590 from lxbsz/task_ship_utilities
Jason Dillaman [Wed, 16 Dec 2020 21:10:24 +0000 (16:10 -0500)]
Merge pull request #1590 from lxbsz/task_ship_utilities

teuthology: run the ship_utilities task only once

Reviewed-by: Kyr Shatskyy <kyrylo.shatskyy@suse.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
3 years agoteuthology: run the ship_utilities task only once 1590/head
Xiubo Li [Tue, 15 Dec 2020 02:49:28 +0000 (10:49 +0800)]
teuthology: run the ship_utilities task only once

This will make sure that the utilities won't removed until the last
user is unwound.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
3 years agoMerge pull request #1589 from kshtsk/wip-wait
kyr [Mon, 14 Dec 2020 20:14:48 +0000 (21:14 +0100)]
Merge pull request #1589 from kshtsk/wip-wait

Add teuthology-wait command

3 years agoscripts: add wait script for watching run 1589/head
Kyr Shatskyy [Fri, 11 Dec 2020 17:48:16 +0000 (18:48 +0100)]
scripts: add wait script for watching run

While using teuthology-suite with --wait option it is
usefull sometimes to split the suite scheduling and
the run waiting. For example, when using tools like
Jenkins we might want to schedule a suite, report
about successful schedule and start waiting only in
the next steps.

Signed-off-by: Kyr Shatskyy <kyrylo.shatskyy@suse.com>
3 years agosuite: improve info message about waiting the run
Kyr Shatskyy [Fri, 11 Dec 2020 18:56:01 +0000 (19:56 +0100)]
suite: improve info message about waiting the run

Signed-off-by: Kyr Shatskyy <kyrylo.shatskyy@suse.com>
3 years agoMerge pull request #1584 from kshtsk/wip-quiet-run
Josh Durgin [Fri, 11 Dec 2020 03:31:50 +0000 (19:31 -0800)]
Merge pull request #1584 from kshtsk/wip-quiet-run

orchestra: introduce quiet mode for remote.run

Reviewed-by: Brad Hubbard <bhubbard@redhat.com>
Reviewed-by: Josh Durgin <jdurgin@redhat.com>
3 years agoMerge pull request #1588 from lxbsz/install1
kyr [Thu, 10 Dec 2020 00:28:35 +0000 (01:28 +0100)]
Merge pull request #1588 from lxbsz/install1

install: pass the 'shaman' to Shaman class

3 years agoinstall: pass the 'shaman' to Shaman class 1588/head
Xiubo Li [Wed, 9 Dec 2020 15:06:59 +0000 (23:06 +0800)]
install: pass the 'shaman' to Shaman class

Signed-off-by: Xiubo Li <xiubli@redhat.com>
3 years agoMerge pull request #1585 from kshtsk/wip-rocket
kyr [Tue, 8 Dec 2020 16:59:20 +0000 (17:59 +0100)]
Merge pull request #1585 from kshtsk/wip-rocket

teuthology-suite: add Rocket.Chat notification

3 years agoteuthology-suite: add Rocket.Chat notification 1585/head
Kyr Shatskyy [Mon, 30 Nov 2020 20:37:05 +0000 (21:37 +0100)]
teuthology-suite: add Rocket.Chat notification

Add Rocket.Chat notification for sleep before teardown.
For details see https://rocket.chat/

Signed-off-by: Kyr Shatskyy <kyrylo.shatskyy@suse.com>