ML Experiment Lifecycle¶
The process of development for machine learning solutions is, at a certain level of abstraction, quite well defined and repeatable. This high-level sequence of steps makes launching new services and improving existing ones as frictionless as possible.
In this article, we walk through the steps of the machine learning experiment lifecycle. We identify the high-level tasks involved, describe their significance, and highlight the underlying principles that motivate their inclusion in the process. The goal for anyone reading this is to come away with a principled understanding of how machine learning development happens. For product and engineering leaders in particular, this will provide a better understanding of the timeline of service development.
Being a conceptual article, we avoid here the technical details of implementing the lifecycle. The Processes section of this documentation covers those specifics, demonstrating practically how to carry out an experiment.
The Role of the Benchmark Revision¶
When creating a brand new machine learning service, we have no basis on which to evaluate the results of experiments. A benchmark is required in order to bootstrap the iterative improvement framework for service revisions.
This benchmark need not be complicated, and in fact it should deliberately not be heavily invested in. A benchmark revision should be able to be architected -- in terms of training and serving logic -- within hours, not days or weeks. Its function is to give the engineer a basis for understanding what would happen if we put no effort into training a more intelligent service. The benchmark revision is never promoted to receive production traffic.
When creating a new service, we always begin by constructing a benchmark revision as a "V0". Along with this comes the development of the many components which exist at the major version level; these are the components which will be shared by all future revisions, until a new major version is released.
We start by describing the steps of the lifecycle that are unique to this initial phase. We then move to the phases that are unique to the revisions of an existing service. At a certain point, the lifecycles unify, and this is where we describe the shared steps of all machine learning experiment lifecycles.
For New Services¶
The Service Specification¶
A service specification is a meeting point between specializations. It transfers knowledge of the problem, its role in the product, its potential pitfalls, and its expected behaviour and performance, from the ones who best understand these things. All of this knowledge is critical in developing an optimal machine learning service. It guides our algorithm selection, our schema construction, and our tests. There is a template for this specification.
It is the joint responsibility of product leaders and machine learning engineers to ensure a thorough and meaningful service specification is created as the outcome of this step.
Once obtained, this document should be a constant reference for the engineer throughout the lifecycle. Some practical questions to prompt this usage:
- Will this candidate algorithm meet the latency SLA?
- Will this candidate algorithm meet the throughput SLA?
- Do my data and regression tests accurately reflect the properties of expected requests to the service?
- Do my serving tests reflect the expected edge cases?
- Do my load tests reflect the SLAs given?
- Does my prediction schema provide all the information needed by the service's consumers?
- Is my prediction schema easy to use without extra transformations?
The Service Schema¶
The schema of a service influences all downstream decisions. It is critical to get correct. Given that our model versioning scheme is centered around schema changes, and the speed of our experiment lifecycle as well as the usability of our API hinges on major version stability, schema changes are among our more expensive operations.
How do we design an effective schema for a new service?
The service specification is very important here. By examining the intended use case of the service by its expected consumers, we should be able to identify all of the information that the service is required to transmit.
As a rule of thumb, we should err on the side of providing more information. While it would be wasteful to expose deep internals of the models at play as part of a prediction schema, any information that seems potentially useful to the consumer should be seriously considered. For example, if the model is a multi-class classifier, exposing the entire class distribution is almost always preferable to a point estimate (simply the name of the majority class).
Because features are retrieved via internal mechanisms, the "Instance" portion of the input schema should be fairly simple to construct. In most cases it is simply identifiers of the relevant entities for a prediction, or perhaps a text input for natural language services.
The "Parameters" portion of the input schema is a place to include any configuration parameters that are intended to be exposed to the consumer, to tune model behaviour. This may include hyperparameters of the inference process; for example, if the length of generated text was to be exposed to the consumer, this would be the place. Parameters are shared by the isntances in a request.
Business Metrics¶
Models may be measured during experimentation with evaluation metrics that measure its predictions against hold-out data with some ground truth metric applied. These are not the metrics we define at this stage of the process. Business metrics are the metrics which we will continually monitor as the service makes predictions in production, in order to measure the success of its effects on the product.
Choosing business metrics is about balancing the needs of the product with the availability of data and the client's experience. It is also about establishing a source of ground truth. Business metrics can give a voice to both the clients and their customers when it comes to machine learning development, by building metrics around the client's experience, or the impact on the customer.
The main tradeoff in choosing metrics is proximity to target outcomes versus data fidelity. The target outcome of a machine learning service is often something difficult to measure either because it is somewhat intangible, or because its measurement is beyond the reach of the business. For example, increased customer happiness is a target of many machine learning services, but 1) measuring happiness is nearly impossible to do accurately, and 2) even if we had defined such a measurement, we lack the interfaces with the customer to make it. We must take a step back, decreasing our proximity to that target metric, in order to gain data fidelity -- in order to be able to capture a metric, and to trust its output.
Sometimes increasing our proximity to a target outcome is possible through product functionality or other dependencies on the broader organization. During this phase of the machine learning lifecycle, the machine learning engineer should be in close contact with product leaders in order to probe for the possibility of obtaining this data. If a product feature or enhancement can be completed in order to gain closer proximity to a target outcome, or to increase the fidelity of data measuring a particular outcome, this needs to be communicated to the appropriate product leader so that this work can be prioritized accordingly.
Critically, the machine learning lifecycle should not continue until business metrics have been established and peer reviewed. These will guide the entire remainder of the life of the service -- up to a major version iteration -- so making any decisions while these metrics are still under development is potentially wasted effort. If a metric is viewed as high priority among the suite of metrics, but it requires data which is not currently being ingested, the lifecycle should be put on hold in order to resolve that dependency. If the metric is viewed as lower priority among the suite of metrics, it may be added in a later iteration.
Data and Regression Tests¶
Data tests compute evaluation metrics against a hold-out set of historical data, resulting in a scalar output. Regression tests are pass/fail tests constructed using hand-crafted examples of expected requests to the service.
The data test results are used when evaluating a new revision, as a component of the process to determine whether it should be promoted to serve production traffic. For a revision to pass this evaluation, its data test results must be at least as good as the results of the previously promoted revision, the one currently serving production traffic. For the benchmark revision, this is an automatic pass, but from that point on, the comparison is made each time.
The regression test results are evaluated at the same time as the data test results. Rather than comparing to the previous revision, we simply ensure that the tests have all passed. If any regression test fails, the revision is rejected and cannot be promoted.
Critically, the predictions on which the metrics are operating are obtained not from the individual model artifacts in the service, but by invoking the service itself. This means that the implementation of a service is free to change over the course of its minor versions, without requiring the tests to be re-written. This also emphasizes the evaluation of the service as a whole, rather than evaluating individual model artifacts in isolation. This creates an environment in which the emphasis for the engineer is always on optimizing towards the customer benefit.
In general, the more tests, the better.
The Load Testing Specification¶
Given an established schema and serving infrastructure framework, load testing is a very well-defined and repeatable process with a small interface for customization. Customization focuses on the difficulty of the test and the performance required in order to consider the test passed.
The values of these parameters should be able to be inferred, if not directly queried, from the service specification.
The load testing specification lives at the major version level because it comes primarily from the service specification, which should ideally remain unchanged for the life of a major version. If, however, those that collaborated on the service specification determine that these parameters need to be adjusted, these can be safely adjusted as part of a new minor version.
For Revisions¶
Experimentation in Notebooks¶
Jupyter Notebooks are the ideal environment in which to try out ideas. When used correctly they are self documenting, easily shared, and create reproducible research. To use Jupyter correctly is to cleanly separate cells, make liberal use of markdown cells, ensure linearity of execution, and maintain clean code standards despite the relative lack of tooling. However, given that the goal of experimentation is to move as quickly as possible, we do not codify or enforce these principles in tooling or process, to avoid unnecessarily slowing down prototyping.
For All Services¶
Experiment Documentation¶
At this point, the engineer's intention for the remainder of the experimental process should be thoroughly documented. This documentation is then submitted for review before proceeding with the remainder of the lifecycle.
The structure and overall tone of this documentation is covered in our article on documentation.
At this point, the documentation should be submitted as a pull request, so that it can be peer reviewed in the absence of any other experiment-specific code. Once the approach has undergone a successful peer review, the engineer can proceed with the next steps.
The Serving Endpoint(s)¶
As you have no doubt noticed by now as you've made your way through these docs, we deploy machine learning services, not machine learning models. Machine learning models are the dependencies of a machine learning service, which is an API that serves a collection of endpoints that combine to provide a particular capability.
When a machine learning service is invoked, an HTTP request is made, and that request must be resolved by a handler. Writing this set of handlers -- whether one or multiple -- is the focus of this step of the lifecycle. The role of the handler is to convert the input from the service schema into the output from the service schema, using the model artifacts generated by the training process.
Why does this step appear before training in the lifecycle? Two reasons. First, the implementation of this API is independent of the performance of the model artifacts. We don't need a completed training process in order to implement the logic that invokes the model artifacts. Second, and more importantly, writing the serving logic first allows the engineer to crystalize in their mind exactly what artifacts are going to be needed by the service's various endpoints. This helps clarify the exact outputs of the training process: what artifacts should be generated, and what their interface should be once generated.
Serving Tests¶
No production code at Paystone is complete without tests. Here, we write tests covering the expected and edge cases for requests made to the service, to ensure that the expected behaviour is achieved in all cases. These tests do not require trained artifacts, because they mock the artifacts. These tests are not about evaluating performance; they are about verifying that the serving logic is correct.
When writing these tests, the main assertion, aside from ensuring that any potential error states are handled correctly, is generally that the model artifacts are invoked in the expected way given the serving logic.
The Training Module¶
Armed with a clear picture of what artifacts are rqeuired to create the service, and a clear idea in mind from experimentation, the goal of this module is to implement that idea in order to generate and optimize the required artifacts.
The standards for code written in the training module are considerably higher than those of the notebook environment. Training module code is held to the same standard as all other code checked into version control; the development environment also comes with tools to help enforce these standards. Maintaining these standards is important. Whether the same engineer or a different one, at some point following the creation of a training module, someone will need to read that code in order to understand the techniques employed to create the service, so that it can be iterated on.
Although this code is generally run once, that execution can potentially be very long running, so some effort should be put into making it optimal. In general, the optimization of training execution time is of course not as important as serving execution time, but it is important nonetheless.
You may notice that we do not unit test the code in a training module. This is because training module code is not as high of a priority to test for edge cases and expected behaviour, since it is for the most part single-use. The utility of these tests does not outweigh the effort required to write them. Mathematical bugs, logical errors, and the like will often be reflected in sub-par end results according to the evaluation metrics. Because of this, it is recommended during development of a training module to execute training on smaller data subsets in order to validate the code. Tools for this process are provided, as discussed in the more detailed Processes article.
Training modules are responsible for generating their own datasets. This should be done as code within the training module, so that the process of gathering the dataset is repeatable. Any auxiliary data used in training should similarly be collected in code as part of the training module. For example, an auxiliary JSON file from the internet should be downloaded using an HTTP request library in the code of the training module, rather than downloaded and placed in the repository manually.
The Infrastructure Specification¶
Infrastructure is uniquely defined for each minor version of a service, even if it remains unchanged from the previous version. The infrastructure specification defines the hardware on which training is executed, and the hardware available to the service in production.
Training happens on a single machine. Serving happens on an instance group of stateless machines with identical provisioned resources. The parameters of the single training machine, of each serving machine, and of the autoscaling for the number of serving machines are all contained in the infrastructure specification.
Other parameters include the behaviour of the service during a scaling event and the request timeout, as examples.
Additional Data and Regression Tests¶
Unlike schema, which lives alongside data and regression tests at the major version level, data and regression tests are mutable. Existing tests should not generally be removed, but new tests can be added as needs are identified. When this happens, the results of the first version that makes use of the new test is not compared to the previous version, because there is nothing to compare to. From that point forward, these tests behave the same as the other tests.
If gaps are identified in the data or regression tests for a service, this is the time to add them.
Deployment¶
At this point, a pull request can be made with all of the code from the previous steps. Peer review takes place as normal, with the serving tests being run as part of continuous integration, and upon a successful merge into the master branch, the engineer is free to execute deployment.
Deployment is not an automated process. Deploying is in fact separated into two manually triggered stages: deploying training, and deploying serving. Executing both of these steps is greatly simplified for the benefit of the engineer.
Deploying training will create a training machine according to the given infrastructure, and execute only the training process, before shutting down the machine. At this point, among the generated artifacts will be an exit code and logs for the process. The engineer should inspect these artifacts to ensure that training completed as expected, before proceeding. If it did not, another pull request may be required to resolve whatever issues prevented this from happening.
On successful completion of the training process, the engineer may deploy the revision. Deploying the revision will create a new set of paths in the MLServing API and deploy its resource stack. Deploying revisions should take a consistent and relatively short amount of time.
At this point, accessible paths in the MLServing API point to newly created resources for the given revision. The engineer is free to experiment with the deployed endpoints to perform any sanity checks or manual tests, though nothing requiring the engineer's intervention is codified or required at this stage. Generally, this is just to confirm that the service is available and running smoothly, though availabilty checks will be run automatically as part of deployment.
Promotion or Teardown¶
Once the engineer is confident in the revision, they can, without another step of peer review, attempt to promote the revision. What is promotion?
When a revision is first deployed, it serves zero production traffic, despite existing in the production API. The path forward for a revision to receive production traffic is to be promoted to the endpoint of its associated major version within the service. Promotion is an action that an engineer can attempt.
We say "attempt" because promotion involves a rigorous series of checks invoking all of the testing that has been put in place so far, with the exception of the serving tests, which have already been consumed. Both the data and regression test results are evaluated as described above. Load testing is conducted against the deployed endpoint, and the results are evaluated against the requirements from the specification.
If any of this testing fails, rather than being promoted, the revision has its resource stack torn down, and its endpoints removed from the API, with logging explaining the reason. This experiment should be considered failed, and the next attempt should take place under the next minor version.
If the testing succeeds, we move into the A/B testing phase.
A/B Testing¶
A/B testing is a statistical framework for empirically comparing multiple solutions to a problem. Data is collected on the performance of each solution, and the results are analyzed in order to determine whether one solution is performing statistically significantly better than the other.
In order for this to work, metrics are needed. This is the data that the A/B testing process measures. Fortunately, we have defined clear business metrics in previous steps of the lifecycle. These business metrics are the basis of the A/B test.
If the new revision for a service is determined to be a statistically significant improvement, then the old revision is torn down, and the newer revision is left with 100% of the production traffic.
If the new revision for a service is determined not to be a statistically significant improvement, then the new revision is torn down, and the existing revision remains in its place, again receiving 100% of the production traffic.