1 .. _tests-integration-testing-teuthology-debugging-tips:
3 Analyzing and Debugging A Teuthology Job
4 ========================================
6 To learn more about how to schedule an integration test, refer to `Scheduling
12 When a teuthology run has been completed successfully, use `pulpito`_ dasboard
15 http://pulpito.front.sepia.ceph.com/<job-name>/<job-id>/
17 .. _pulpito: https://pulpito.ceph.com
19 or ssh into the teuthology server to view the results of the integration test:
23 ssh <username>@teuthology.front.sepia.ceph.com
25 and access `teuthology archives`_, as in this example:
29 nano /a/teuthology-2021-01-06_07:01:02-rados-master-distro-basic-smithi/
31 .. note:: This requires you to have access to the Sepia lab. To learn how to
32 request access to the Sepia lab, see:
33 https://ceph.github.io/sepia/adding_users/
35 Identifying Failed Jobs
36 -----------------------
38 On pulpito, a job in red means either a failed job or a dead job. A job is
39 combination of daemons and configurations defined in the yaml fragments in
40 `qa/suites`_ . Teuthology uses these configurations and runs the tasks listed
41 in `qa/tasks`_, which are commands that set up the test environment and test
42 Ceph's components. These tasks cover a large subset of use cases and help to
43 expose bugs not exposed by `make check`_ testing.
45 .. _make check: ../tests-integration-testing-teuthology-intro/#make-check
47 A job failure might be caused by one or more of the following reasons:
49 * environment setup (`testing on varied
50 systems <https://github.com/ceph/ceph/tree/master/qa/distros/supported>`_):
51 testing compatibility with stable releases for supported versions.
53 * permutation of config values: for instance, `qa/suites/rados/thrash
54 <https://github.com/ceph/ceph/tree/master/qa/suites/rados/thrash>`_ ensures
55 that we run thrashing tests against Ceph under stressful workloads so that we
56 can catch corner-case bugs. The final setup config yaml file used for testing
59 /a/<job-name>/<job-id>/orig.config.yaml
61 More details about config.yaml can be found at `detailed test config`_
63 Triaging the cause of failure
64 ------------------------------
66 When a job fails, you will need to read its teuthology log in order to triage
67 the cause of its failure. Use the job's name and id from pulpito to locate your
68 failed job's teuthology log::
70 http://qa-proxy.ceph.com/<job-name>/<job-id>/teuthology.log
74 /a/<job-name>/<job-id>/teuthology.log
80 nano /a/teuthology-2021-01-06_07:01:02-rados-master-distro-basic-smithi/5759282/teuthology.log
82 Every job failure is recorded in the teuthology log as a Traceback and is
83 added to the job summary.
85 Find the ``Traceback`` keyword and search the call stack and the logs for
86 issues that caused the failure. Usually the traceback will include the command
89 .. note:: The teuthology logs are deleted from time to time. If you are unable
90 to access the link in this example, just use any other case from
91 http://pulpito.front.sepia.ceph.com/
96 In short: first check to see if your job failure was caused by a known issue,
97 and if it wasn't, raise a tracker ticket.
99 After you have triaged the cause of the failure and you have determined that it
100 wasn't caused by the changes that you made to the code, this might indicate
101 that you have encountered a known failure in the upstream branch (in the
102 example we're considering in this section, the upstream branch is "octopus").
103 If the failure was not caused by the changes you made to the code, go to
104 https://tracker.ceph.com and look for tracker issues related to the failure by
105 using keywords spotted in the failure under investigation.
107 If you find a similar issue on https://tracker.ceph.com, leave a comment on
108 that issue explaining the failure as you understand it and make sure to
109 include a link to your recent test run. If you don't find a similar issue,
110 create a new tracker ticket for this issue and explain the cause of your job's
111 failure as thoroughly as you can. If you're not sure what caused the job's
112 failure, ask one of the team members for help.
114 Debugging an issue using interactive-on-error
115 ---------------------------------------------
117 When you encounter a job failure during testing, you should attempt to
118 reproduce it. This is where ``--interactive-on-error`` comes in. This
119 section explains how to use ``interactive-on-error`` and what it does.
121 When you have verified that a job has failed, run the same job again in
122 teuthology but add the `interactive-on-error`_ flag::
124 ideepika@teuthology:~/teuthology$ ./virtualenv/bin/teuthology -v --lock --block $<your-config-yaml> --interactive-on-error
126 Use either `custom config.yaml`_ or the yaml file from the failed job. If
127 you use the yaml file from the failed job, copy ``orig.config.yaml`` to
128 your local directory::
130 ideepika@teuthology:~/teuthology$ cp /a/teuthology-2021-01-06_07:01:02-rados-master-distro-basic-smithi/5759282/orig.config.yaml test.yaml
131 ideepika@teuthology:~/teuthology$ ./virtualenv/bin/teuthology -v --lock --block test.yaml --interactive-on-error
133 If a job fails when the ``interactive-on-error`` flag is used, teuthology
134 will lock the machines required by ``config.yaml``. Teuthology will halt
135 the testing machines and hold them in the state that they were in at the
136 time of the job failure. You will be put into an interactive python
137 session. From there, you can ssh into the system to investigate the cause
140 After you have investigated the failure, just terminate the session.
141 Teuthology will then clean up the session and unlock the machines.
146 * `Testing Ceph: Pains & Pleasures <https://www.youtube.com/watch?v=gj1OXrKdSrs>`_
147 * `Teuthology Training <https://www.youtube.com/playlist?list=PLrBUGiINAakNsOwHaIM27OBGKezQbUdM->`_
148 * `Intro to Teuthology <https://www.youtube.com/watch?v=WiEUzoS6Nc4>`_
150 .. _Scheduling Test Run: ../tests-integration-testing-teuthology-workflow/#scheduling-test-run
151 .. _detailed test config: https://docs.ceph.com/projects/teuthology/en/latest/detailed_test_config.html
152 .. _teuthology archives: ../tests-integration-testing-teuthology-workflow/#teuthology-archives
153 .. _qa/suites: https://github.com/ceph/ceph/tree/master/qa/suites
154 .. _qa/tasks: https://github.com/ceph/ceph/tree/master/qa/tasks
155 .. _interactive-on-error: https://docs.ceph.com/projects/teuthology/en/latest/detailed_test_config.html#troubleshooting
156 .. _custom config.yaml: https://docs.ceph.com/projects/teuthology/en/latest/detailed_test_config.html#test-configuration
157 .. _testing priority: ../tests-integration-testing-teuthology-intro/#testing-priority
158 .. _thrash: https://github.com/ceph/ceph/tree/master/qa/suites/rados/thrash