Testing¶

At Paystone, we have adopted and maintained a culture where no production code is complete without thorough tests. This includes the machine learning team, where thorough test coverage is a requirement of the entire machine learning codebase. Our experiment lifecycle reflects that, as it includes stages for testing every component of a machine learning service from multiple angles.

In this article, we'll focus not on the areas in which testing are important or how we view testing, as these answers should be reasonably obvious. We instead focus on our style of test writing and our principles, some of which are particular to the nuances of Python.

This article can serve as a reference for any machine learning engineer wondering how best to go about writing tests for their code. We make heavy use of examples to illustrate our rules and principles.

You will notice that at the moment, we exclusively cover unit tests. Integration tests are not yet a part of our workflow in the machine learning codebase.

Test Organization¶

With the information from this section, one should be able to scaffold the test file for any module precisely.

File Structure Mirrors Implementation¶

The file structure of our tests should exactly match the file structure of the rest of the codebase which is under test. For example, given the following file structure for a top-level directory:

top_level/
    module1/
        a.py
        b.py
    module2/
        c.py
        d.py
        submodule3/
            e.py

The tests folder would contain a top-level directory looking like this:

test_top_level/
    module1/
        test_a.py
        test_b.py
    module2/
        test_c.py
        test_d.py
        submodule3/
            test_e.py

Note the following:

Every file has a corresponding test file which has the same parent, modulo some "test_" prefixes.
Top level directories and all test files have a "test_" prefix, but other directories do not.
There are no __init__.py files.

The top level directory requiring a "test_" prefix is a quirk of Python's testing tools and their import rules. If the tests for top_level were in a directory named tests/top_level/, they would become undiscoverable, as top_level would be shadowed by the module itself. This is not an issue for child modules, or files.

Test files, of course, require the test_ prefix to be discoverable.

Group Tests in Classes¶

When testing a class, place all tests under a single class whose name is identical to the class, but with a Test_ prefix. For example, given a file module_with_class.py:

class MyClass:
    def __init__(self) -> None:
        self.a = 0
        self.b = 1

    def method1(self, arg1: str) -> str:
        ...

    def method2(self, arg1: str) -> str:
        ...

A corresponding test file test_module_with_class.py would be structured like so:

class Test_MyClass:
    def test_method1_under_certain_circumstances(self) -> None:
        ...

    def test_method2_under_certain_circumstances(self) -> None:
        ...

When testing a function, place all tests again under a single class whose name is identical to the function, but with a Test_ prefix. For example, given a file module_with_function.py:

def my_function(*, a: str, b: int) -> int:
    ...

A corresponding test file test_module_with_function.py would be structured like so:

class Test_my_function:
    def test_under_certain_circumstances(self) -> None:
        ...

Note that in the case of a class, the tests always begin (following the "test_" prefix) with the name of the method.

Mock All Dependencies¶

A test of a function which calls another function requires a mock of the called function, whether it is from a third-party library, a first-party library, or even the same module. The test should then assert that the called function is being used correctly, and if necessary, mock its output value in order to make assertions on a scenario.

The naming conventions for mock fixtures are as follows:

For third-party import dependencies, the name of the mock is the name of the module plus the name of the function, with underscores replacing dots, and full lowercase.
- For example, requests.get is mocked as requests_get.
For first-party import dependencies, the name of the mock is the name of the function.
- For example, given a function paystone.some_package.some_module.function which is a dependency of function_under_test, the tests for function_under_test would receive a mock fixture called function.
For sibling function dependencies, the name of the mock is the name of the function plus the suffix _mock.
- For example, given two functions f1 and f2 defined in the same file where f2 depends on f1, the tests for f2 would receive a mock fixture called f1_mock.

Combining these rules into a single example, given a function which calls a third-party dependency, a first-party dependency, and a sibling function. Let's call the module containing this function package.module.

import requests
from paystone.some_package import some_utility


def helper_function(*, a: str) -> str:
    ...


def main_function(*, url: str, a: str) -> str:
    response = requests.get(url)
    utility_output = some_utility(a=a)
    helper_output = helper_function(a=a)
    ...

The tests for this module would require three mock fixtures:

import pytest
from unittest.mock import Mock
from pytest_mock import MockerFixture
from package.module import helper_function, main_function

@pytest.fixture
def requests_get(mocker: MockerFixture) -> Mock:
    return mocker.patch("requests.get")


@pytest.fixture
def some_utility(mocker: MockerFixture) -> Mock:
    return mocker.patch("package.module.some_utility")


@pytest.fixture
def helper_function_mock(mocker: MockerFixture) -> Mock:
    return mocker.patch("package.module.helper_function")

Contents of a Test¶

Static Typing Standards Remain¶

Put simply, there are no differences in the expectations regarding static typing in the context of a test file. Test files must have all signatures fully typed, including for fixtures.

Follow the Branches¶

When enumerating the test cases for a function or method, all logical branches must be accounted for. Following the branches and enumerating all combinatorial outcomes is generally a good method for ensuring that all possible scenarios are accounted for.

This also may reveal opportunities for re-factoring, as a function with too many branches -- which is not generally a desirable quality -- will lead to an explosion of test cases.

Name Tests After the Scenario¶

Each test should apply a unique scenario to the function or method. Generally, this is controlled via the arguments, the outputs of dependencies, and for classes, the state of the object under test.

In a clear and concise way, the name of a test should describe the scenario that these variables combine to create. Often, this leads to a set of tests whose names are fairly consistent, and differ in predictable and well formatted ways. When this happens, tests should be sorted logically according to the combinations. As an example:

class Test_my_function:
    def test_a_over_100_b_over_10(self) -> None:
        output = my_function(a=101, b=11)
        ...

    # from this point on in the example we use .pyi stub file style for brevity
    def test_a_over_100_b_under_10(self) -> None: ...
    def test_a_over_100_b_is_10(self) -> None: ...
    def test_a_under_100_b_over_10(self) -> None: ...
    def test_a_under_100_b_under_10(self) -> None: ...
    def test_a_under_100_b_is_10(self) -> None: ...
    def test_a_is_100_b_over_10(self) -> None: ...
    def test_a_is_100_b_under_10(self) -> None: ...
    def test_a_is_100_b_is_10(self) -> None: ...

One could imagine a function whose behaviour depends on the output of a dependency as well as an argument, in which case a test name might be something like test_dependency_returns_greater_than_10_a_over_100.

Multiple Assertions Are OK¶

Because tests prescribe a scenario and not an outcome, they are not explicitly tied to a single assertion. If the function has an output value, there should of course be an assertion on that value. If the function also has dependencies, the arguments with which that dependency was called likely also need to be validated, and this is acceptable to do within the same test.

Expected return values, dependency invocations, and side effects on state are all valid assertions within the context of a single scenario, or test.

Use Parameterization Where Appropriate¶

Parameterization is a technique whereby a single test is re-used under multiple circumstances, by creating a grid of values contained in an object. Each combination of these values is passed as an argument to the test, and the test uses the object to execute the same function multiple times. Often, one of the properties of the object is an expected return value, and other properties may convey expectations on state or dependencies.

Parameterization is a powerful technique and is encouraged where possible.

Stateless¶

Finally, and perhaps most simply stated: each test should execute independently from other tests. No test's execution should result in a state change which influences the execution of any other test.

This implies as well that tests setup and tear down their own data and conditions, and that the order of execution for a set of tests does not matter.