1 .. _tests-integration-testing-teuthology-debugging-tips:
3 Analyzing and Debugging A Teuthology Job
4 ========================================
6 To learn more about how to schedule an integration test, refer to `Scheduling
12 When a teuthology run has been completed successfully, use `pulpito`_ dasboard
15 http://pulpito.front.sepia.ceph.com/<job-name>/<job-id>/
17 .. _pulpito: https://pulpito.ceph.com
19 or ssh into the teuthology server to view the results of the integration test:
23 ssh <username>@teuthology.front.sepia.ceph.com
25 and access `teuthology archives`_, as in this example:
29 nano /a/teuthology-2021-01-06_07:01:02-rados-master-distro-basic-smithi/
31 .. note:: This requires you to have access to the Sepia lab. To learn how to
32 request access to the Sepia lab, see:
33 https://ceph.github.io/sepia/adding_users/
35 Identifying Failed Jobs
36 -----------------------
38 On pulpito, a job in red means either a failed job or a dead job. A job is
39 combination of daemons and configurations defined in the yaml fragments in
40 `qa/suites`_ . Teuthology uses these configurations and runs the tasks listed
41 in `qa/tasks`_, which are commands that set up the test environment and test
42 Ceph's components. These tasks cover a large subset of use cases and help to
43 expose bugs not exposed by `make check`_ testing.
45 .. _make check: ../tests-integration-testing-teuthology-intro/#make-check
47 A job failure might be caused by one or more of the following reasons:
49 * environment setup (`testing on varied
50 systems <https://github.com/ceph/ceph/tree/master/qa/distros/supported>`_):
51 testing compatibility with stable realeases for supported versions.
53 * permutation of config values: for instance, `qa/suites/rados/thrash
54 <https://github.com/ceph/ceph/tree/master/qa/suites/rados/thrash>`_ ensures
55 that we run thrashing tests against Ceph under stressful workloads so that we
56 can catch corner-case bugs. The final setup config yaml file used for testing
59 /a/<job-name>/<job-id>/orig.config.yaml
61 More details about config.yaml can be found at `detailed test config`_
63 Triaging the cause of failure
64 ------------------------------
66 When a job fails, you will need to read its teuthology log in order to triage
67 the cause of its failure. Use the job's name and id from pulpito to locate your
68 failed job's teuthology log::
70 http://qa-proxy.ceph.com/<job-name>/<job-id>/teuthology.log
74 /a/<job-name>/<job-id>/teuthology.log
80 nano /a/teuthology-2021-01-06_07:01:02-rados-master-distro-basic-smithi/5759282/teuthology.log
82 Every job failure is recorded in the teuthology log as a Traceback and is
83 added to the job summary.
85 Find the ``Traceback`` keyword and search the call stack and the logs for
86 issues that caused the failure. Usually the traceback will include the command
89 .. note:: The teuthology logs are deleted from time to time. If you are unable
90 to access the link in this example, just use any other case from
91 http://pulpito.front.sepia.ceph.com/
96 In short: first check to see if your job failure was caused by a known issue,
97 and if it wasn't, raise a tracker ticket.
99 After you have triaged the cause of the failure and you have determined that it
100 wasn't caused by the changes that you made to the code, this might indicate
101 that you have encountered a known failure in the upstream branch (in the
102 example we're considering in this section, the upstream branch is "octopus").
103 If the failure was not caused by the changes you made to the code, go to
104 https://tracker.ceph.com and look for tracker issues related to the failure by
105 using keywords spotted in the failure under investigation.
107 If you find a similar issue on https://tracker.ceph.com, leave a comment on
108 that issue explaining the failure as you understand it and make sure to
109 include a link to your recent test run. If you don't find a similar issue,
110 create a new tracker ticket for this issue and explain the cause of your job's
111 failure as thoroughly as you can. If you're not sure what caused the job's
112 failure, ask one of the team members for help.
114 Debugging an issue using interactive-on-error
115 ---------------------------------------------
117 It is important to be able to reproduce an issue when investigating its cause.
118 Run a job similar to the failed job, using the `interactive-on-error`_ mode in
121 ideepika@teuthology:~/teuthology$ ./virtualenv/bin/teuthology -v --lock --block $<your-config-yaml> --interactive-on-error
123 For this job, use either `custom config.yaml`_ or the yaml file from
124 the failed job. If you intend to use the yaml file from the failed job, copy
125 ``orig.config.yaml`` to your local dir and change the `testing priority`_
126 accordingly, like so::
128 ideepika@teuthology:~/teuthology$ cp /a/teuthology-2021-01-06_07:01:02-rados-master-distro-basic-smithi/5759282/orig.config.yaml test.yaml
129 ideepika@teuthology:~/teuthology$ ./virtualenv/bin/teuthology -v --lock --block test.yaml --interactive-on-error
132 In the event of job failure, teuthology will lock the machines required by
133 ``config.yaml``. Teuthology will halt at an interactive python session.
134 By sshing into the targets, we can investigate their ctx values. After we have
135 investigated the system, we can manually terminate the session and let
136 teuthology clean the session up.
141 * `Testing Ceph: Pains & Pleasures <https://www.youtube.com/watch?v=gj1OXrKdSrs>`_
143 .. _Scheduling Test Run: ../tests-integration-testing-teuthology-workflow/#scheduling-test-run
144 .. _detailed test config: https://docs.ceph.com/projects/teuthology/en/latest/detailed_test_config.html
145 .. _teuthology archives: ../tests-integration-testing-teuthology-workflow/#teuthology-archives
146 .. _qa/suites: https://github.com/ceph/ceph/tree/master/qa/suites
147 .. _qa/tasks: https://github.com/ceph/ceph/tree/master/qa/tasks
148 .. _interactive-on-error: https://docs.ceph.com/projects/teuthology/en/latest/detailed_test_config.html#troubleshooting
149 .. _custom config.yaml: https://docs.ceph.com/projects/teuthology/en/latest/detailed_test_config.html#test-configuration
150 .. _testing priority: ../tests-integration-testing-teuthology-intro/#testing-priority
151 .. _thrash: https://github.com/ceph/ceph/tree/master/qa/suites/rados/thrash