CI/CD¶

Continuous integration and delivery is a key component of the delivery of machine learning services. It ensures quality, increases developer productivity, and prevents simple mistakes.

Our CI/CD framework of choice is Cloud Build. It is tightly integrated with the GCP ecosystem and easy enough to use (though it is not without its pain points). Cloud Build triggers are the artifacts of our CI/CD configuration: they are the jobs that run in response to particular changes to the codebase, or to manual invocation.

Our Cloud Build triggers are divided into three categories:

Master branch triggers.
Pull Request triggers.
Manual triggers.

In this article, we will go through each of our triggers and explain what they do and when they are triggered.

Master Triggers¶

In general, our master branch triggers re-deploy Docker images and auxiliary services. These are the changes which can be confidently made automatically on the arrival of new code into the master branch.

Images¶

We have master triggers for the following Docker images:

Clean Code
Deployment
Notebooks
PSML

These triggers perform 3 steps in sequence:

Pull the latest version of the image.
Rebuild the image, using the latest version as a base, from the master branch of the codebase.
Push the resulting image as the new latest version.

Auxiliary Cloud Run Services¶

We have master triggers for all of our auxiliary services that run on Cloud Run. They perform the same 3 steps as for Docker images, then a fourth unique step: they re-deploy the Cloud Run service using the new latest image.

Auxiliary Compute Engine Services¶

We have master triggers for the remaining auxiliary services, those that run on Compute Engine instances. The 4th step after the 3 that they share with Docker image deployment is to restart the Compute Engine instance, which causes it to re-pull the relevant Docker image, after it has been rebuilt.

Pull Request Triggers¶

End to End Tests¶

Our template experiments serve a secondary purpose beyond guiding machine learning engineers on experiment organization: each of them is deployed from end to end as a test of our deployment pipeline.

The steps involved in this test are:

Create the new experiment using the CLI.
Re-build and push the relevant experiments Docker image.
Deploy a training job for the experiment.
At the completion of the training job, deploy a service revision to the MLServing API.
Instantiate a client for the new service revision, and use it to poll the new revision's health endpoint until it is ready for requests.
When it becomes ready for requests, delete the revision, undeploying its entire stack.

Steps 3 through 6 are contained in a CLI command called psml test end-to-end.

The purpose of this test is to ensure that changes to our infrastructure never break the deployment pipeline for new machine learning services (or new revisions).

Unit Tests¶

When a pull request is made which includes changes to any module for which there are corresponding tests in tests/, those tests are run using psml test. For example, changes in paystone/storage would trigger a job that runs psml test paystone storage.

Specifically, the tests which are run in this fashion are:

psml test psml
psml test psml_engine
psml test service [service] for each of our existing services.
psml test paystone [package] for each of our existing Paystone packages.

Serving Tests¶

When a pull request is made which only includes changes to one experiment in the experiments folder, the serving tests for that experiment are run using psml test serving.

Aside from the initial logic to determine the experiment being tested, this check behaves the same as the unit tests check.

Manual Triggers¶

Manual triggers are invoked by a developer through Cloud Build directly. By framing the primary functions of the experiment lifecycle as manual CI triggers, we:

Improve telemetry by invoking a discrete process that is tracked by Cloud Build.
Improve security by applying the necessary permissions only to the relevant service account.
Make the steps of the lifecycle more convenient for the developer by wrapping them each in a button-click process.

They are manual triggers as opposed to master branch triggers primarily because of the difficulty in assigning a particular trigger to a particular code change. Our tool for identifying a given trigger to run based on a commit to master is the files changed in the commit, and there is a non-trivial and potentially at times ambiguous mapping between file changes and experiment lifecycle stages.

Below are the processes we have created manual triggers for:

Executing a training run for a given experiment.
Deploying a service revision for a given experiment and version.
Attempting to promote a given revision for a given service.
- The word "attempt" comes from the evaluation steps prior to the actual promotion command; if any of these fail, promotion does not happen.
- The evaluation steps are:
  - Evaluating the data and regression tests of the promotion candidate.
  - Running acceptance testing for the promotion candidate.
Tearing down a given service revision's resource stack.