Analyzing and Debugging A Teuthology Job
========================================
-For scheduling an integration test please refer to, `Scheduling Test Run`_.
+To learn more about how to schedule an integration test, refer to `Scheduling
+Test Run`_.
-Once a teuthology run is successfully completed, we can access the results using
-pulpito dashboard, which looks like:
+When a teuthology run has been completed successfully, use `pulpito`_ dasboard
+to view the results::
-http://pulpito.front.sepia.ceph.com/<job-name>/<job-id>/
+ http://pulpito.front.sepia.ceph.com/<job-name>/<job-id>/
-or via sshing into teuthology server::
+.. _pulpito: https://pulpito.ceph.com
+
+or ssh into the teuthology server::
ssh <username>@teuthology.front.sepia.ceph.com
-and accessing `teuthology archives`_, for example::
-
- nano /a/teuthology-2021-01-06_07:01:02-rados-master-distro-basic-smithi/
+and access `teuthology archives`_, like this for example:
+
+ .. prompt:: bash $
+
+ nano /a/teuthology-2021-01-06_07:01:02-rados-master-distro-basic-smithi/
-.. note:: This would require Sepia lab access. To know how to request it, see:
+.. note:: This requires you to have access to the Sepia lab. To learn how to
+ request access to the Speia lab, see:
https://ceph.github.io/sepia/adding_users/
-On pulpito, the jobs in red specify either a failed or dead job.
-Here, a job is combination of daemons and configurations that are formed using
+On pulpito, jobs in red specify either a failed job or a dead job.
+A job is combination of daemons and configurations that are formed using
`qa/suites`_ yaml fragments.
-Taking these configurations, teuthology runs the tasks that are present in
+Teuthology uses these configurations and runs the tasks that are present in
`qa/tasks`_, which are commands used for setting up the test environment and
testing Ceph's components.
-These tasks help us in covering large subset of usecase scenarios and hence
-exposing the bugs which were uncaught by `make check`_ testing.
+These tasks cover a large subset of use cases and help to
+expose the bugs that aren't caught by `make check`_ testing.
.. _make check: ../tests-integration-testing-teuthology-intro/#make-check
-A job failure hence might be because of:
+A job failure might be caused by one or more of the following reasons:
-* environment setup(`testing on varied systems<https://github.com/ceph/ceph/tree/master/qa/distros/supported>_`):
+* environment setup (`testing on varied
+ systems <https://github.com/ceph/ceph/tree/master/qa/distros/supported>`_):
testing compatibility with stable realeases for supported versions.
-* permutation of config values: for instance, qa/suites/rados/thrash ensures to
- test Ceph under stressful workload, so that we be able to catch corner case
- bugs.
- The final setup config yaml that would be used for testing can be accessed
- at::
+* permutation of config values: for instance, `qa/suites/rados/thrash
+ <https://github.com/ceph/ceph/tree/master/qa/suites/rados/thrash>`_ ensures
+ running thrashing tests against Ceph under stressful workloads, so that we
+ are able to catch corner-case bugs. The final setup config yaml used for
+ testing can be accessed at::
/a/<job-name>/<job-id>/orig.config.yaml
-More details about config.yaml can be found on `detailed test config`_
+More details about config.yaml can be found at `detailed test config`_
Triaging the cause of failure
------------------------------
-To triage a job failure, open the teuthology log for it using either(from the
-pulpito page):
+To triage a job failure, open the teuthology log for it using either the job
+name or the job id (from the pulpito page):
-http://qa-proxy.ceph.com/<job-name>/<job-id>/teuthology.log
+ http://qa-proxy.ceph.com/<job-name>/<job-id>/teuthology.log
-and then opening log file with signature as:
+Open the log file:
/a/<job-name>/<job-id>/teuthology.log
nano /a/teuthology-2021-01-06_07:01:02-rados-master-distro-basic-smithi/5759282/teuthology.log
-Generally, a job failure is recorded in teuthology log as a Traceback which gets
-added to job summary.
-While analyzing a job failure, we generally start looking for ``Traceback``
-keyword and further see the call stack and logs that might had lead to failure.
-Most of the time, traceback will also be including the failing command.
+A job failure is recorded in the teuthology log as a Traceback and is
+added to the job summary.
+
+To analyze a job failure, locate the ``Traceback`` keyword and examine the call
+stack and logs for issues that caused the failure. Usually the traceback
+will include the command that failed.
.. note:: the teuthology logs are deleted every once in a while, if you are
- unable to access example link, please feel free to refer any other case from
- http://pulpito.front.sepia.ceph.com/
+ unable to access example link, please feel free to refer any other
+ case from http://pulpito.front.sepia.ceph.com/
Reporting the Issue
-------------------
-Once the cause of failure is triaged, and is something which might not be
-related to the developer's code change, this indicates that it might be a
-generic failure for the upstream branch (in our case octopus), in which case, we
-look for related failure keywords on https://tracker.ceph.com/.
-If a similar issue has been reported via a tracker.ceph.com ticket, please add
-any relevant feedback to it. Otherwise, please create a new tracker ticket for
-it. If you are not familiar with the cause of failure, someone else will look at
-it.
+After you have triaged the cause of the failure and you have determined that the
+failure was not caused by the developer's code change, this might indicate a
+known failure for the upstream branch (in our case, the upstream branch is
+octopus). If the failure was not caused by a developer's code change, go to
+https://tracker.ceph.com and look for tracker issues related to the failure by using keywords spotted in the failure under investigation.
+
+If a similar issue has been reported via a tracker.ceph.com ticket, add to it a
+link to the new test run and any relevant feedback. If you don't find a ticket
+referring to an issue similar to the one that you have discovered, create a new
+tracker ticket for it. If you are not familiar with the cause of failure, ask
+one of the team members for help.
Debugging an issue using interactive-on-error
---------------------------------------------
-To investigate an issue, the first step would be to try to reproduce it, for
-that purpose. For this purpose you can run a job similar to the failed job,
-using `interactive-on-error`_ mode in teuthology::
+It is important to be able to reproduce an issue when investigating its cause.
+Run a job similar to the failed job, using the `interactive-on-error`_ mode in
+teuthology::
ideepika@teuthology:~/teuthology$ ./virtualenv/bin/teuthology -v --lock --block $<your-config-yaml> --interactive-on-error
-we can either have a `custom config.yaml`_ or use the one from failed job; for
-which copy the ``orig.config.yaml`` to your local dir and change the `testing
-priority`_ accordingly, which would look like::
+For this job, use either `custom config.yaml`_ or the yaml file from
+the failed job. If you intend to use the yaml file from the failed job, copy
+``orig.config.yaml`` to your local dir and change the `testing priority`_
+accordingly, like so::
ideepika@teuthology:~/teuthology$ cp /a/teuthology-2021-01-06_07:01:02-rados-master-distro-basic-smithi/5759282/orig.config.yaml test.yaml
ideepika@teuthology:~/teuthology$ ./virtualenv/bin/teuthology -v --lock --block test.yaml --interactive-on-error
-Teuthology will then lock the machines required by the ``config.yaml``, when
-their is job failure, which halts at an interactive python session which let's
-us investigate the ctx values and the targets via sshing into them, once we have
+In the event of job failure, teuthology will lock the machines required by
+``config.yaml``. Teuthology will halt at an interactive python session.
+By sshing into the targets, we can investigate their ctx values. After we have
investigated the system, we can manually terminate the session and let
-teuthology cleanup.
+teuthology clean the session up.
Suggested Resources
--------------------
.. _interactive-on-error: https://docs.ceph.com/projects/teuthology/en/latest/detailed_test_config.html#troubleshooting
.. _custom config.yaml: https://docs.ceph.com/projects/teuthology/en/latest/detailed_test_config.html#test-configuration
.. _testing priority: ../tests-integration-testing-teuthology-intro/#testing-priority
+.. _thrash: https://github.com/ceph/ceph/tree/master/qa/suites/rados/thrash
==========================================
Ceph has two types of tests: :ref:`make check <make-check>` tests and
-integration tests. When a test requires multiple machines, root access or lasts
-for a longer time (for example, to simulate a realistic Ceph deployment), it is
+integration tests. When a test requires multiple machines, root access, or lasts
+for a long time (for example, to simulate a realistic Ceph workload), it is
deemed to be an integration test. Integration tests are organized into "suites",
which are defined in the `ceph/qa sub-directory`_ and run with the
``teuthology-suite`` command.
nightlies" because the Ceph core developers used to live and work in
the same time zone and from their perspective the tests were run overnight.
-The results of the nightlies are published at http://pulpito.ceph.com/. The
-developer nick shows in the test results URL and in the first column of the
-Pulpito dashboard. The results are also reported on the `ceph-qa mailing list
-<https://ceph.com/irc/>`_ for analysis.
+The results of nightly test runs are published at http://pulpito.ceph.com/
+under the user ``teuthology``. The developer nick appears in URL of the the
+test results and in the first column of the Pulpito dashboard. The results are
+also reported on the `ceph-qa mailing list <https://ceph.com/irc/>`_.
Testing Priority
----------------
* **200 <= Priority < 1000:** Use this priority for large test runs that can
be done over the course of a week.
-In case you don't know how many jobs would be triggered by ``teuthology-suite``
-command, use ``--dry-run`` to get a count first and then issue
-``teuthology-suite`` command again, this time without ``--dry-run`` and with
-``-p`` and an appropriate number as an argument to it.
+To learn how many jobs the ``teuthology-suite`` command will trigger, use the
+``--dry-run`` flag. If you are happy with the number of jobs, issue the ``teuthology-suite`` command again without
+``--dry-run`` and with ``-p`` and an appropriate number as an argument.
To skip the priority check, use ``--force-priority``. In order to be sensitive
to the runs of other developers who also need to do testing, please use it in
----------------
The ``suites`` directory of the `ceph/qa sub-directory`_ contains all the
-integration tests, for all the Ceph components.
+integration tests for all the Ceph components.
`ceph-deploy <https://github.com/ceph/ceph/tree/master/qa/suites/ceph-deploy>`_
install a Ceph cluster with ``ceph-deploy`` (`ceph-deploy man page`_)
`dummy <https://github.com/ceph/ceph/tree/master/qa/suites/dummy>`_
get a machine, do nothing and return success (commonly used to
- verify the integration testing infrastructure works as expected)
+ verify that the integration testing infrastructure works as expected)
`fs <https://github.com/ceph/ceph/tree/master/qa/suites/fs>`_
test CephFS mounted using FUSE
``teuthology-describe`` was added to the `teuthology framework`_ to facilitate
documentation and better understanding of integration tests.
-The upshot is that tests can be documented by embedding ``meta:``
-annotations in the yaml files used to define the tests. The results can be
-seen in the `teuthology-desribe usecases`_
+Tests can be documented by embedding ``meta:`` annotations in the yaml files
+used to define the tests. The results can be seen in the `teuthology-desribe
+usecases`_
Since this is a new feature, many yaml files have yet to be annotated.
-Developers are encouraged to improve the documentation, in terms of both
-coverage and quality.
+Developers are encouraged to improve the coverage and the quality of the
+documentation.
How integration tests are run
-----------------------------
-Given that - as a new Ceph developer - you will typically not have access
-to the `Sepia lab`_, you may rightly ask how you can run the integration
-tests in your own environment.
+As a new Ceph developer you will probably not have access to the `Sepia lab`_.
+You might however be able to run some integration tests in your own
+environment. Ask members from the relevant team how to do this.
One option is to set up a teuthology cluster on bare metal. Though this is a
non-trivial task, it `is` possible. Here are `some notes
ssh <username>@teuthology.front.sepia.ceph.com
- This requires that you have access to the Sepia lab. Learn about requesting
- access here:
-
- https://ceph.github.io/sepia/adding_users/
-
-#. Install teuthology in a virtual environment and activate that virtual
- environment. Follow the relevant instructions in `Running Your First Test`_.
+ This requires Sepia lab access. To request access to the Sepia lab, see:
+ https://ceph.github.io/sepia/adding_users/
#. Run the ``teuthology-suite`` command:
+Other frequently used/useful options are ``-d`` (or ``--distro``),
+``--distroversion``, ``--filter-out``, ``--timeout``, ``flavor``, ``-rerun``,
+``-l`` (for limiting number of jobs) , ``-n`` (for how many times the job will
+run). Run ``teuthology-suite --help`` to read descriptions of these and other
+options.
.. _teuthology_testing_qa_changes:
You just have to make sure to tell the ``teuthology-suite`` command to use a
separate branch for running the tests.
-The separate branch can be passed to the command by using ``--suite-repo`` and
-``--suite-branch``. The first option (``--suite-repo``) accepts the link to the GitHub fork where your PR branch exists and the second option (``--suite-branch``) accepts the name of the PR branch.
+If you made changes only in ``qa/``
+(https://github.com/ceph/ceph/tree/master/qa), you do not need to rebuild the
+binaries. You can use existing binaries that are built periodically for master and other stable branches and run your test changes against them.
+Your branch with the qa changes can be tested by passing two extra arguments to the ``teuthology-suite`` command: (1) ``--suite-repo``, specifying your ceph repo, and (2) ``--suite-branch``, specifying your branch name.
For example, if you want to make changes in ``qa/`` after testing ``branch-x``
-(which shows up in ceph-ci as a branch named ``wip-username-branch-x``), you
-can do so by running following command:
+(for which the ceph-ci branch is ``wip-username-branch-x``), run the following
+command::
.. prompt:: bash $
Pulpito Dashboard
*****************
-Once the teuthology job is scheduled, the status/results for test run could
-be checked from https://pulpito.ceph.com/.
-It could be used for quickly checking out job logs, their status, etc.
+After the teuthology job is scheduled, the status and results of the test run
+can be checked at https://pulpito.ceph.com/.
Teuthology Archives
*******************
-Once the tests have finished running, the log for the job can be obtained by
-clicking on job ID at the Pulpito page for your tests. It's more convenient to
-download the log and then view it rather than viewing it in an internet browser
-since these logs can easily be up to size of 1 GB. It is easier to
-ssh into the teuthology machine again (``teuthology.front.sepia.ceph.com``), and
-access the following path::
+After the tests have finished running, the log for the job can be obtained by
+clicking on the job ID at the Pulpito page associated with your tests. It's
+more convenient to download the log and then view it rather than viewing it in
+an internet browser since these logs can easily be up to 1 GB in size. It is
+easier to ssh into the teuthology machine (``teuthology.front.sepia.ceph.com``)
+and access the following path::
/ceph/teuthology-archive/<test-id>/<job-id>/teuthology.log
-For example, for above test ID path is::
+For example: for the above test ID, the path is::
/ceph/teuthology-archive/teuthology-2019-12-10_05:00:03-smoke-master-testing-basic-smithi/4588482/teuthology.log
-This way the log can be viewed remotely without having to wait too
-much.
+This method can be used to view the log more quickly than would be possible through a browser.
.. note:: To access archives more conveniently, ``/a/`` has been symbolically
linked to ``/ceph/teuthology-archive/``. For instance, to access the previous
Killing Tests
-------------
-Sometimes a teuthology job might not complete running for several minutes or
-even hours after tests that were trigged have completed running and other
-times wrong set of tests can be triggered is filter wasn't chosen carefully.
-To save resource it's better to termniate such a job. Following is the command
-to terminate a job::
+``teuthology-kill`` can be used to kill jobs that have been running
+unexpectedly for several hours, or when developers want to terminate tests
+before they complete.
+
+Here is the command that terminates jobs:
+
+.. prompt:: bash $
- teuthology-kill -r teuthology-2019-12-10_05:00:03-smoke-master-testing-basic-smithi
+ teuthology-kill -r teuthology-2019-12-10_05:00:03-smoke-master-testing-basic-smithi
Let's call the argument passed to ``-r`` as test ID. It can be found
easily in the link to the Pulpito page for the tests you triggered. For