Experiment Branches¶

As discussed in our MLServing API architecture article, our machine learning services follow semantic versioning. For any given service, there may be many historical versions, and even multiple actively deployed versions. There may also be long-running experiment lifecycles building revisions of existing services. The master branch of our codebase, however, contains only one implementation for each service at any given time. This leads to some questions:

How are the implementations of old service revisions maintained for experimental lineage?
How are the implementations of multiple deployed revisions of the same service maintained, for example during A/B testing?
How do developers simultaneously develop new experiments for existing services, and fix issues with active production deployments?

The question of maintaining the code of old revisions can actually be answered without any mention of human processes, so we handle this question first. The simple answer is that by including in the artifacts of an experiment the git hash of the machine learning repository at the time that the experiment was trained and deployed, we retain the ability to examine the entire codebase just as it is being used by the deployed service. A CLI command to checkout the codebase as of the git hash of a particular experiment can provide a convenience layer here.

Answering the remaining two questions will require explaining the concept of "experiment branches", which is our approach to formalizing the relationship between the experiment lifecycle and Git. The remainder of the article explains what experiment branches are and how to use them, and uses that as a basis to explain how it answers our remaining questions.

What Experiment Branches Are¶

Our process centers around the idea of an experiment branch. Experiment branches are the machine learning equivalent of release branches. A release branch is a copy of the main codebase (a branch off of master) where an individual or team of developers can work on a new feature until it is complete. It packages the small changes of many commits and individual pull requests into a single update to the main version of the codebase that delivers an entire customer benefit.

Release branches are not used in the growth platform at Paystone. The growth platform product engineering organization embraces a culture of continuous delivery, and release branches do not align with continuous delivery. However, the ideal of continuous delivery itself does not mesh with the machine learning lifecycle. The justification for this can be broken into two parts.

There is no objective truth when it comes to the "correct" implementation of a machine learning service. This subjects all non-trivial changes to the scientific process.
- Application code is the implementation of a feature specification created by the product engineering organization. This feature specification is a codification of a customer benefit. It becomes the objective target functionality of the code which implements it: the code either implements the specification completely and accurately, or it does not, and that is the binary success criteria for the code. The feature specification itself may not optimally generate its associated customer benefit, but the optimization of feature specifications exists separately from the creation of code which implements feature specifications.
- In contrast, the ability of the implementation of a machine learning service to deliver the service's associated customer benefit has no objective, binary outcome. It must be measured through data, and compared to other implementations based on that continuous measurement.
The many stages of the scientific process -- codified by our experiment lifecycle -- combine to create a single deliverable, meaning none of the individual stages have public deliverables of their own.
- Application code can, for the most part, be delivered incrementally. Customer benefits can be decomposed into features, and features can be decomposed into discrete deliverable improvements.
- The same is not true of machine learning code, due to the heavy interdependency of stages in the experiment lifecycle. Writing load tests depends on having a deployed serving application; serving applications depend on loading training artifacts; training artifacts depend on executing a training job; training jobs depend on having data tests; data tests depend on establishing a schema, which depends on establishing a service specification. None of these stages in the lifecycle produce a deliverable; they instead combine to form a single deliverable, which is a machine learning service revision.
- While deployment checkpoints could technically be inserted into the experiment lifecycle, these would not have any value to the application, the client, or the customer. For example, continuous delivery that produced model artifacts by executing training before its associated serving application was complete would technically be a more "continuous delivery" approach, but the outcome of this incremental deployment would simply be files in a Cloud Storage bucket.

In a machine learning world where continuous delivery is not a goal, the concept of a release branch, or an experiment branch, is actually very useful.

Why Not Feature Flags?¶

While the growth platform does not use release branches, they do use feature flags extensively. Feature flags are part of a control flow service that allows application services to parse requests in order to determine which of a set of alternatives to use to serve the request. One of its main uses is controlling releases: one can configure feature flags to route some percentage of production traffic to a previous implementation of a feature, while routing the rest to a new implementation, while easily controlling the movement of traffic from old to new over time.

At a glance, this seems like a good approach to releasing new versions of machine learning services. The problem is that it does not mesh with the infrastructure of machine learning in the same way that it does for application code.

As outlined in our MLServing API article, each service revision is deployed as its own backend service, with its own dedicated instance group. This is because machine learning artifacts often require large amounts of dedicated memory, processsing power, and at times GPU memory in order to hold their parameters. Deploying multiple machine learning service revisions on the same compute stack could spell disaster, as multiple large artifacts would be constantly competing for significant resources on their shared machine.

Feature flags are designed to operate within a single process, on a single instance at a time. They are queried, return a boolean, and that boolean is used in the control flow of a process. These two ideas are at odds with each other: in order to use feature flags to route requests between different machine learning service revisions, those service revisions would need to be deployed on the same compute resources. As we've established, they are not, and cannot be.

In short, feature flags are a great mechanism for developing new releases of application code, because two implementations of the same application feature can make use of the same compute resources. The same cannot be said of two machine learning service revisions, making feature flags a non-starter for machine learning.

Using Experiment Branches¶

When beginning a new experiment lifecycle, checkout a branch off of master. Name it after the service of the revision you are about to create, with the prefix experiment, and your GitHub handle as a suffix, all separated by --. The username suffix is to differentiate between multiple engineers working on experimental revisions of the same service at the same time. While it should be rare, it can happen, and may become particularly relevant as our team grows in size.

For example, as GitHub user "@myusername" beginning work on a new version of "housing":

~> git checkout master
~> git pull origin master
~> git checkout -b experiment--housing--myusername

This is your new base branch for all work contained within the experiment lifecycle. Develop documentation and code according to the lifecycle, creating issues as normal and naming branches after the issues as discussed in our GitHub practices. When making pull requests for these issue branches, make the base branch your experiment branch, rather than master.

Deployments of training jobs and serving endpoints will always happen from the branch corresponding to the experiment.

Attempts at promotion will end one of three ways:

Evaluation fails. The new revision is not promoted, and the branch remains active.
Evaluation passes. There is no existing revision for this service. The new revision is promoted to receive 100% of production traffic, and the branch is deleted.
Evaluation passes. There is an existing revision serving production traffic. The new revision is promoted to receive a portion of traffic in an A/B testing setting, and the branch remains active.

When evaluation fails at this stage -- which is hopefully rare -- the still-active branch can be inspected as part of a retrospective for the experiment. When ready, the engineer should then close the branch and move on to repeating the lifecycle for the next experiment.

At this point, the remainder of the lifecycle is out of the engineer's hands, but for completeness we'll describe what happens with the rest of the process: when A/B testing concludes, a determination is made about which revision to retain in order to serve 100% of production traffic. If the "challenger" revision is accepted, its branch is merged into master and deleted. If the incumbent revision is retained, the experiment branch is deleted without merging.

That's the process. Now, how does this help us when we have multiple deployed revisions, and how does it help us simultaneously participate in experiment lifecycles and fix bugs in production?

Managing Multiple Deployed Revisions¶

When only one revision for a given service is actively deployed, the relationship between the codebase and the production API is straightforward: the experiment module corresponding to the service contains exactly the code of its production deployment. A problem arises when multiple different revisions of the same service are splitting production traffic. This currently occurs during a period of A/B testing, which is a required component of the experiment lifecycle. In other words, it's something that happens every time we deploy new machine learning service revisions, so it is a frequent, desired, and unavoidable situation.

If we concede that not all production resources are reflected in a single master branch, experiment branches otherwise solve this problem quite elegantly. This is because they essentially mirror in version control what is happening in the API. Consider this parallel between production resources and Git branches:

When deploying new services:
- The sole revision is promoted immediately to receive 100% of the service's production traffic.
- The experiment branch is immediately merged into master and deleted.
When deploying revisions to existing services:
- The revision is promoted to receive a portion of the service's production traffic.
- The experiment branch remains active, because its resulting deployed resources are active.
At the conclusion of A/B testing:
- If successful in supplanting the incumbent:
  - The new revision takes over 100% of production traffic.
  - The experiment branch is merged into master and deleted.
- If unsuccessful in supplanting the incumbent:
  - The new revision is removed from the API.
  - The experiment branch is deleted without merging.

This parity between code and resources is the ideal state of any process governing the use of version control.

Experimenting and Maintaining Simultaneously¶

Though we have many forms of tests that go into a machine learning service revision, it is of course possible for bugs to come up post-deployment. While the most volatile time will always be immediately following deployment -- a time when we are also least likely to be working on a new experiment lifecycle for the service -- there is also the potential for bugs later on in the life of a deployed revision. We cannot discount the likelihood of bugs in production during active experiment lifecycles for the same service.

With one branch maintaining one source of truth for an entire service, the simultaneous development of these two projects would be impossible. With experiment branches, this is trivial.

Issues regarding bugs in production deployments of services have their associated branches opened by branching off of the branch corresponding to the revision. If it is the sole active revision, or the incumbent revision in an A/B testing scenario, the base branch is master. If it is the challenger revision in an A/B testing scenario, the base branch is the corresponding experiment branch. All of this can happen independently of an engineer working on a new experiment lifecycle, who works exclusively using their own distinct experiment branch as a base.