git.apps.os.sepia.ceph.com Git - ceph.git/blob

   1 .. _tests-integration-testing-teuthology-debugging-tips:
   2
   3 Analyzing and Debugging A Teuthology Job
   4 ========================================
   5
   6 To learn more about how to schedule an integration test, refer to `Scheduling
   7 Test Run`_.
   8
   9 When a teuthology run has been completed successfully, use `pulpito`_ dasboard
  10 to view the results::
  11
  12    http://pulpito.front.sepia.ceph.com/<job-name>/<job-id>/
  13
  14 .. _pulpito: https://pulpito.ceph.com
  15
  16 or ssh into the teuthology server::
  17
  18     ssh <username>@teuthology.front.sepia.ceph.com
  19
  20 and access `teuthology archives`_, like this for example:
  21
  22   .. prompt:: bash $
  23
  24      nano /a/teuthology-2021-01-06_07:01:02-rados-master-distro-basic-smithi/
  25
  26 .. note:: This requires you to have access to the Sepia lab. To learn how to
  27           request access to the Speia lab, see:
  28           https://ceph.github.io/sepia/adding_users/
  29
  30 On pulpito, jobs in red specify either a failed job or a dead job.
  31 A job is combination of daemons and configurations that are formed using
  32 `qa/suites`_ yaml fragments.
  33 Teuthology uses these configurations and runs the tasks that are present in
  34 `qa/tasks`_, which are commands used for setting up the test environment and
  35 testing Ceph's components.
  36 These tasks cover a large subset of use cases and help to
  37 expose the bugs that aren't caught by `make check`_ testing.
  38
  39 .. _make check: ../tests-integration-testing-teuthology-intro/#make-check
  40
  41 A job failure might be caused by one or more of the following reasons:
  42
  43 * environment setup (`testing on varied
  44   systems <https://github.com/ceph/ceph/tree/master/qa/distros/supported>`_):
  45   testing compatibility with stable realeases for supported versions.
  46
  47 * permutation of config values: for instance, `qa/suites/rados/thrash
  48   <https://github.com/ceph/ceph/tree/master/qa/suites/rados/thrash>`_ ensures
  49   running thrashing tests against Ceph under stressful workloads, so that we
  50   are able to catch corner-case bugs. The final setup config yaml used for
  51   testing can be accessed at::
  52
  53   /a/<job-name>/<job-id>/orig.config.yaml
  54
  55 More details about config.yaml can be found at `detailed test config`_
  56
  57 Triaging the cause of failure
  58 ------------------------------
  59
  60 To triage a job failure, open the teuthology log for it using either the job
  61 name or the job id (from the pulpito page):
  62
  63    http://qa-proxy.ceph.com/<job-name>/<job-id>/teuthology.log
  64
  65 Open the log file:
  66
  67    /a/<job-name>/<job-id>/teuthology.log
  68
  69 for example in our case::
  70
  71   nano /a/teuthology-2021-01-06_07:01:02-rados-master-distro-basic-smithi/5759282/teuthology.log
  72
  73 A job failure is recorded in the teuthology log as a Traceback and is
  74 added to the job summary.
  75
  76 To analyze a job failure, locate the ``Traceback`` keyword and examine the call
  77 stack and logs for issues that caused the failure. Usually the traceback
  78 will include the command that failed.
  79
  80 .. note:: the teuthology logs are deleted every once in a while, if you are
  81           unable to access example link, please feel free to refer any other
  82           case from http://pulpito.front.sepia.ceph.com/
  83
  84 Reporting the Issue
  85 -------------------
  86
  87 After you have triaged the cause of the failure and you have determined that the
  88 failure was not caused by the developer's code change, this might indicate a
  89 known failure for the upstream branch (in our case, the upstream branch is
  90 octopus). If the failure was not caused by a developer's code change, go to
  91 https://tracker.ceph.com and look for tracker issues related to the failure by using keywords spotted in the failure under investigation.
  92
  93 If a similar issue has been reported via a tracker.ceph.com ticket, add to it a
  94 link to the new test run and any relevant feedback. If you don't find a ticket
  95 referring to an issue similar to the one that you have discovered, create a new
  96 tracker ticket for it. If you are not familiar with the cause of failure, ask
  97 one of the team members for help.
  98
  99 Debugging an issue using interactive-on-error
 100 ---------------------------------------------
 101
 102 It is important to be able to reproduce an issue when investigating its cause.
 103 Run a job similar to the failed job, using the `interactive-on-error`_ mode in
 104 teuthology::
 105
 106     ideepika@teuthology:~/teuthology$ ./virtualenv/bin/teuthology -v --lock --block $<your-config-yaml> --interactive-on-error
 107
 108 For this job, use either `custom config.yaml`_ or the yaml file from
 109 the failed job. If you intend to use the yaml file from the failed job, copy
 110 ``orig.config.yaml`` to your local dir and change the `testing priority`_
 111 accordingly, like so::
 112
 113     ideepika@teuthology:~/teuthology$ cp /a/teuthology-2021-01-06_07:01:02-rados-master-distro-basic-smithi/5759282/orig.config.yaml test.yaml
 114     ideepika@teuthology:~/teuthology$ ./virtualenv/bin/teuthology -v --lock --block test.yaml --interactive-on-error
 115
 116
 117 In the event of job failure, teuthology will lock the machines required by
 118 ``config.yaml``. Teuthology will halt at an interactive python session.
 119 By sshing into the targets, we can investigate their ctx values.  After we have
 120 investigated the system, we can manually terminate the session and let
 121 teuthology clean the session up.
 122
 123 Suggested Resources
 124 --------------------
 125
 126   * `Testing Ceph: Pains & Pleasures <https://www.youtube.com/watch?v=gj1OXrKdSrs>`_
 127
 128 .. _Scheduling Test Run: ../tests-integration-testing-teuthology-workflow/#scheduling-test-run
 129 .. _detailed test config: https://docs.ceph.com/projects/teuthology/en/latest/detailed_test_config.html
 130 .. _teuthology archives: ../tests-integration-testing-teuthology-workflow/#teuthology-archives
 131 .. _qa/suites: https://github.com/ceph/ceph/tree/master/qa/suites
 132 .. _qa/tasks: https://github.com/ceph/ceph/tree/master/qa/tasks
 133 .. _interactive-on-error: https://docs.ceph.com/projects/teuthology/en/latest/detailed_test_config.html#troubleshooting
 134 .. _custom config.yaml: https://docs.ceph.com/projects/teuthology/en/latest/detailed_test_config.html#test-configuration
 135 .. _testing priority: ../tests-integration-testing-teuthology-intro/#testing-priority
 136 .. _thrash: https://github.com/ceph/ceph/tree/master/qa/suites/rados/thrash