Docker¶

Docker is at the heart of machine learning development at Paystone. Virtually all code that is deployed, and even most that runs exclusively locally, runs in a Docker container. It simplifies deployment, unifies local and production environments, and prevents discussions of "but it worked on my machine!".

In this article, we'll go through each of our Docker images and explain their purpose and usage. We'll also discuss how Dockerfiles are written, as we have a custom system at play there.

Our Images¶

All images, unless otherwise noted, have some basic things in common:

They all have the same version of Python installed.
They all have every Paystone package installed.
They all have the CLI installed.

Clean Code¶

This image is used by the clean code check as well as the environment commands of the CLI. Its primary differentiator from the PSML image is that it requires our custom stubs, and it manages the pyproject.toml file for customizing our clean code tools.

Deployment¶

This image is the environment in which all CI build steps execute.

Prior to this image existing, many CI steps required preparatory commands: pip installations of third party and custom packages, environment variable setting, and the like. The Deployment image was born out of the desire to unify the environment of all CI steps, to eliminate these preparatory commands and clean up our builds.

Because it effectively acts as the "host" in a CI context, it is actually running Docker containers itself, through the Docker-out-of-Docker paradigm. To do this, we include Docker installation in the Dockerfile.

For some commands of CI build steps, the gcloud CLI is (unfortunately, given its size) necessary to invoke, so the Deployment image includes the gcloud CLI, at the expense of nearly an extra GB of size.

Experiments¶

This image is the environment in which all machine learning services are deployed, and all training jobs are executed.

There is one tag of the Experiments image for each experiment, or service revision. The primary ways in which tags of the image differ from experiment to experiment are the Linux and Python package dependencies, and for these we use Docker build arguments.

The only other way the images differ from experiment to experiment is in the base image used. When an experiment requires a GPU, the base image must be nvidia/cuda. This image does not have Python installed, so we then need to install our currently used version of Python. Otherwise, we simply use the corresponding version of the python Docker image. Which base image to use is determined automatically using the infrastructure of an experiment: if either training or serving needs a GPU, the experiment image uses the Cuda base.

GCloud¶

This image is the exception to the basic configuration: it does not have Python, the CLI, or Paystone packages installed. It exists purely to wrap the google/cloud-sdk image with the same Linux user name (paystone) as other images, and to add our Docker config that allows us to pull from Artifact Registry.

Notebooks¶

This image is used both on local machines and on Compute Engine instances to run Jupyter Lab.

It includes some convenience requirements for data science: Numpy, Pandas, Scikit-Learn, and Tensorflow. It includes custom configuration for JupyterLab, which is personally maintained for each engineer and synced with Google Cloud Storage, so that it is equivalent between local and remote machines.

Because notebooks have their own version control system, it includes GCSFuse in order to sync notebooks from GCS on startup. This process means that starting up Jupyter with this image takes a bit longer than normal -- and depends somewhat on internet speed -- but it ensures that common notebooks are always synced amongst the team.

The GCloud CLI is included out of necessity, to copy files.

Playground¶

This image is used both locally and in production to deploy the Playground auxiliary service.

This Dockerfile contains very little customization beyond the standard components (discussed below). The one piece of unique configuration is the start script, and associated start command, which launches Streamlit.

PSML¶

This image is used by all CLI commands which do not have their own specific image to run. For example, psml experiments train must use its associated experiments image, so it does not use PSML, but psml test paystone storage has no specific needs beyond the standard configuration, so it uses PSML.

The only configuration beyond standard components is the pytest.ini configuration file for tests, since all unit tests run through this image.

Services¶

This image is used by all auxiliary services, with tags for each individual service. As with Experiments, the tags correspond to build arguments that provide the Python package installations required for the particular service.

The image installs the package corresponding to the service and its dependencies. It also includes a start script that launches Gunicorn (in production) or Uvicorn (locally) to run the service, as well as Gunicorn configuration.

Dockerfiles¶

Because so many of the Docker images mentioned above have similar features, but are not similar enough to warrant a separate base image from which they all inherit, we have taken a novel approach to building Dockerfiles.

Our Dockerfiles use a component system.

A "component" is a partial Dockerfile; it contains a set of instructions which can be re-used in other Dockerfiles. It has three configuration features: component references, guards, and variables.

Component References¶

Component references can be used in Dockerfiles and components alike. A component reference is the name of a component surrounded in square brackets [], which expands into the content of that component.

For example, with no additional configuration, a component whose file is Dockerfile.component1 can be inserted into a Dockerfile (or other component) with the line [component1].

Guards¶

Guards are sections of a component which can be toggled on or off. They begin with a comment formatted # guard:name_of_guard, and end with the comment # endguard. They cannot be used in Dockerfiles, only components.

If a guard is referenced by name in a component reference, it is considered "on", and the contents within it are included. If it is not referenced, the guard's contents are entirely skipped.

For example, given the following guard in component component1:

# guard:my_guard
RUN echo 'hello world!'
# endguard

With a component reference in a Dockerfile [component1 my_guard], the line RUN echo 'hello world!' will appear in the rendered Dockerfile at the place where the guard exists.

With a component reference in a Dockerfile [component1], the line RUN echo 'hello world!' will not appear.

Variables¶

Variables are placeholders in components where values can be injected. Variables cannot be used in Dockerfiles, only components.

Variables are given values in component references via [component name:value] syntax. They are placed in components by simply using the name of the variable in all caps. This syntax means that reserved Docker keywords such as RUN and COPY should not be used as variable names.

For example, a component component1 may have a line RUN echo 'ECHO_THIS'. The component reference [component1 echo_this:words] would result in the rendered Dockerfile having the line RUN echo 'words'.

Contributing¶

When making changes to Dockerfiles, first consider whether the change you are making belongs to an existing component, and if it does, make the change there.

Our goal is to avoid having the same lines appear in multiple Dockerfiles. If you are adding functionality to one Dockerfile, it may be fine to write it inline in the Dockerfile itself. If you are adding functionality to multiple Dockerfiles, it is likely better off as a component, which you can then reference in the necessary Dockerfiles.