The Experiment Package Structure¶
Our collection of machine learning services is entirely represented in the experiments
package at the top level of the datascience repository.
Each of these machine learning experiments follows a consistent structure. This structure is represented in two template projects: one predicts housing prices given a handful of tabular features, and the other predicts the sentiment of a piece of text. As natural language and tabular data are our two most common modalities, these are the areas we've chosen as representative samples. However, the structure they embody is intended to be universal across machine learning experiments.
Here, we describe our machine learning experiment package structure, explaining why it is the way that it is. We take a top-down approach, starting at the parent directories and working our way down to individual files.
Before reading this article, you may want to take a moment to find and review the structure of our template experiment package. It may help to leave this open as you continue reading.
After reading this article, you should have a clear understanding of why each file is placed where it is, what it should contain, and how its contents should roughly be laid out.
Example File Tree¶
To start, it helps to visualize the file tree that we are about to discuss. Here we maintain an up-to-date reflection of the file tree for the housing template experiment.
We omit __init__.py
files for simplicity. These are present at every level in the experiments
package.
experiments/
housing/
schema/
request.py
response.py
service.py
train/
main.py
models.py
serve/
app.py
tests/
acceptance.py
data.py
regression.py
serve/
test_app.py
infrastructure/
hardware.py
requirements.txt
packages.txt
Schema¶
The schema module contains Pydantic models and enums that represent the input, output, and route structure of the service.
Requests¶
The requests submodule contains 2 primary data models:
Instance
, which is the schema of each observation being passed into the service for prediction.Parameters
, which is the schema of additional parameterization used to configure the serving logic or model artifacts during their prediction routine.
Note that paystone.serving.schema.max_instances
is a decorator that can be used on the Instance
data model in order to restrict the maximum number of Instances that can be passed in a request body. This is exemplified in the sentiment template.
Responses¶
The responses submodule contains one primary data model: Prediction
. This is the schema of the items of the predictions
field of the response by the service. Each Prediction
data model instance must correspond to an Instance
from the instances field of the request body, in the same order they were presented.
Service¶
The service submodule contains two schema objects:
- An enum of the route names in the service. The values of this enum must match exactly with the keyword argument names passed to
create_serving_app
. This enum is used to ensure that all routes in the schema are provided handlers when creating the serving application. - A data model containing A/B testing configuration.
Training¶
The training module is built around the Experiment
class from paystone.training.runner
. This class is described in detail in the article on Paystone packages, so here we'll focus on the structure surrounding this class.
In general, training modules are built by composing individual functions into a computation graph. That graph's execution is then supplied with a hyperparameter data model, which has default values for each hyperparameter that can be overridden on execution. Each function ("task") in the graph has its output serialized as a *.artifact
file in the artifacts folder of that experiment run.
When executing a training run locally, artifacts are saved locally. When a training run is deployed, artifacts are saved to Cloud Storage.
Training is executed via the CLI with psml experiments train
, followed by arguments specifying the experiment, and potentially overriding hyperparameter values. There is also an option to provide "task overrides", which are functions that can be inserted in place of functions (or "tasks") in the computation graph on a per-execution basis. For example, you may use this feature to replace a data-fetching task with one that returns a small sample dataset, to run a quicker training process and verify things are working. During a deployed training run, this CLI command is invoked without any hyperparameter overrides nor task overrides.
As with all other CLI commands, you can check psml-cli experiments train --help
for specific details on invoking this command.
There are some requirements of the training module's structure to keep in mind:
- One of the tasks in the experiment must execute (and return the output of)
paystone.training.evaluation.evaluate_models
.- This runs the data and regression tests of the experiment against the serving application, as described in an earlier article.
- The test results artifact that comes from this step is the artifact that is analyzed by the promotion command that is a central part of the lifecycle.
Experiment().run()
will refuse to execute if a task whose return type is the training test results class has not yet been added.
- As the template projects demonstrate, the training module must have a
main.py
submodule which, when executed as a module (i.e. when__name__ == "__main__"
), does the following:- Creates an
Experiment
. - Runs the experiment with
Experiment().run()
. - Saves the experiment task results as artifacts with
Experiment().save()
.
- Creates an
Serving¶
The main entrypoint for the Docker container running your serving application is going to be serve.app
. This is where the handlers for various endpoints should be combined into a FastAPI application, using the paystone.serving package. This means passing each handler to the paystone.serving.app.create_serving_app
function with the keyword argument name equal to the path for the handler. For example:
app = create_serving_app(large=large_model_handler, small=small_model_handler)
Would direct requests with /large
as the path to large_model_handler
, and /small
to small_model_handler
.
As the engineer developing a serving application you are encouraged to organize your code as you would any other proper software. Avoid over-stuffing a single app.py
file with too much logic, and expand the serve
module as needed with submodules organizing your needed functionality. This is not reflected in the housing template due to its simplicity, but more complex serving applications may require this level of organization.
The artifacts you require in order to create your handler(s) are added to the signature of each handler as dependencies via fastapi.Depends
and paystone.serving.dependencies.Artifact
. The argument to Artifact
is one of two things:
- The name of the training step from the same experiment which produced the artifact you wish to load, e.g.
Depends(Artifact("build_model"))
. - The names of a service, major version, minor version, and artifact separated by dots, composing a full path to an artifact from another experiment, e.g.
Depends(Artifact("housing.v1.m0.build_model"))
.
The remainder of the signature is the request body, which is captured by an initial argument whose type is a paystone.serving.types.request.Request
with two type parameters, the Instance
and Parameters
models from the schema submodule.
For an example of the resulting function signature, see the app.py of the housing template.
Tests¶
Regression Tests¶
Regression tests invoke the serving application with instances and parameters for which the desired outcome is fairly obvious. The result of a regression test is always boolean: it either passes or fails.
Regression tests could be invocations with single instances whose classification should clearly be a particular class. For example, the text "I am so happy right now" having a "positive" label when evaluated by a sentiment service.
Regression tests could also invoke the service with multiple instances whose classifications or point estimates have some clear relationship, such as one instance having a higher probability of belonging to the positive class than the other, or having a higher point estimate. For example, two houses being passed into a pricing service where one is 5000 square feet and has 10 bedrooms while the other is a studio apartment should result in the first house having a higher price.
Regression tests are so named because a service should never regress past the point of succeeding at all of these tests. If it does, as discussed in the experiment lifecycle, deployment will fail.
The signature of a regression test is always the same: they take a single argument, the application client, and they define their own data. The test should not assert the boolean result, but rather return it.
Data Tests¶
Data tests invoke the serving application with instances and parameters that are drawn from the hold-out test set that should be generated during training. While training and validation sets are used during experimentation, the hold-out test set is reserved for executing these tests.
Data tests must produce a scalar metric over the predictions. This metric should have a "direction" indicating whether higher or lower scores are preferred for the metric. This is notated by one of two decorators on the function: @goal.maximize
or @goal.minimize
.
Data tests are evaluated against the incumbent version with some slight leniancy. Newer versions do not have to score better than the previous version, but they cannot score worse beyond a certain threshold. That threshold exists to account for statistically insignificant variance; if one version scores 98.2% on 10,000 instances, and the next version scores 98.1% on 10,000 instances, the test is still considered passed.
The signature of a data test is always the same, aside from the data types. It takes 3 arguments, which are the application client, the input test data ("X_test"), and the target test data ("y_test"). The types of these data arguments may differ depending on the problem domain.
Acceptance Tests¶
Acceptance tests invoke the deployed revision of the service with custom-generated requests with a velocity and cadence intended to stretch the limits of the infrastructure. The goal is to measure the performance of the infrastructure running the service against its established SLAs.
The configuration of acceptance tests is contained entirely in one data model, with one of its fields being a callback. In general, the specification requires two things:
- A callback for generating requests, which takes no arguments and returns a list of
Request[Instance, Parameters]
objects following the experiment schema. - Parameterization for the rate at which the service is invoked.
The parameterization works as follows. The acceptance testing framework executes consecutive "stages", where a stage makes requests on behalf of a number of "virtual users" for a given amount of time, with each user taking a given amount of time between requests. An arbitrary number of stages can be created. The number of virtual users in the stage, the duration of the stage, and the time between requests for each virtual user are all customizable within the acceptance testing data model.
The choices made in parameterizing the acceptance test should reflect the SLAs for the service.
Serving Tests¶
Serving tests are unit tests of the serving application's endpoint handlers. They should mock out all artifacts. The goal of a serving test is not to evaluate the performance of the service mathematically; it is to ensure the robustness of the service by testing its serving logic.
Standard testing practices apply here. Tests should cover as much code as possible by following all branches, and account for as many edge cases as possible. This may often mean creating multiple mocks of the relevant artifacts that generate different predictions, to evaluate how the serving logic handles these cases. This is likely the primary source of trouble for many serving applications. As a guiding question, you may start by asking yourself, "what happens if my model prediction is NaN, or zero, or negative?".
Infrastructure¶
hardware.py
contains, in a single data model, the specification for both the VM used for training, and the group of VMs used for serving.
In the training section you will find customization for vCPUs, memory, GPUs, and disk. In the serving section you will find all the same, as well as configuration for autoscaling, request handling, and Gunicorn deployment.
One parameter that is easy to overlook, but could be critical to acceptance testing success, is gunicorn_workers
. This is the number of workers deployed on each instance within the instance group. Each worker has its own copies of the artifacts in memory, meaning you will load all models once per worker. When running a large language model, you may be significantly limited, likely even to only 1 worker. This is because you can only fit one copy of the model in memory, including GPU memory. However, with smaller footprint models, especially ones that don't require GPU memory such as random forests, tuning gunicorn_workers
to a higher number can potentially significantly increase throughput on the service.
For more details on what each parameter does, check the field descriptions on the data models in paystone.compute.types.infrastructure
. If we've been following our own Style Guide, these models should be well documented.
requirements.txt
is a standard Python dependencies file where any dependencies specific to the experiment should be included. Dependencies should be anchored to as specific a version as possible, rather than providing a range; for example, preferring ==1.1.2
to >=1,<2
.
packages.txt
is similar to requirements.txt
, but it contains system packages, or Linux dependencies. Often, Python libraries for machine learning may have dependencies on C code that is contained in a Linux package available through APT. Any such dependencies should be listed in packages.txt
. As with requirements.txt
, version anchors should be as specific as possible.