The Paystone ML CLI¶

A command-line interface is maintained for common functions within the machine learning codebase. CLIs make common, repetitive tasks easier by composing potentially many steps into a single command.

Our CLI is called psml. In this article we will outline the purpose of the CLI, its design, and a high level categorization of its functionality.

Purpose¶

The CLI began as a tool to make machine learning engineers' lives easier.

As mentioned in the article on Docker images, virtually all actions one can take within the machine learning ecosystem center around executing commands in Docker containers. Docker commands, with their many volume mounts, environment variables, port mappings, and flags, are difficult to remember and write accurately. The CLI's main goal is to make these frequently-used Docker commands easily accessible.

As continuous integration became a more prominent feature of our operations, a problem arose: the functions of steps in our CI builds were starting to heavily overlap local functionality covered by the CLI, without actually using the CLI itself.

And so a second purpose for the CLI was born: a unifying interface for all steps of CI builds.

The philosophy of the CLI is that a given command should do the same thing regardless of environment, modulo volume mounts. A given command can run locally, or in CI, or on a production server, and in all cases it would do the same thing, but the locations on disk that it persists results to might have mounts that come from different places on the host machine depending on the environment.

With the fusion of these two purposes the CLI has this summarizing mission statement: provide a clean interface to the most common tasks in machine learning development, which can be used in any environment.

Design¶

The CLI is actually made up of two Python packages, and they correspond to the two purposes of the CLI as a whole.

PSML (psml) is a package containing a Typer application that implements the various commands that we have chosen to include in the interface (discussed more in the next section).
- This package accomplishes "providing a clean interface to the most common tasks in machine learning development".
The PSML Engine (psml_engine) is a package that converts a user's CLI command into a Docker command, which runs the equivalent CLI command in the appropriate Docker container, with the appropriate Docker configuration.
- This package enables psml to "be used in any environment".
- This includes determining, among other things:
  - The name of the image to invoke.
  - The volumes to mount.
  - The ports to open.
  - The environment variables to set.
  - The command to run.

These packages both have entrypoint scripts. Somewhat confusingly, but necessarily, the entrypoint for psml is psml-cli, and the entrypoint for psml_engine is psml.

Usage¶

A user invokes psml with the command they want to run. This is the engine. The engine converts the user's command into a Docker command, which runs in an appropriately configured Docker container the (usually equivalent) psml-cli command.

Though it is not recommended or an intended use case, the user could technically run the CLI command on their host directly by (in most cases) replacing psml with psml-cli in the original command. We will see an example of this in the way that training and serving VMs use the CLI later on.

By a Developer¶

Let's take an example of a user's invocation. psml test is a group of commands for running PyTest over a set of unit tests. psml test paystone storage runs the tests for the Paystone package paystone.storage. Walking through what happens when this is run:

The user runs psml test paystone storage in the terminal.
The PSML Engine, whose entrypoint is psml, converts its arguments test paystone storage into a Docker command.
- Omitting the details, this command is roughly docker run ... us-central1-docker.pkg.dev/nicejob-production/psml:latest psml-cli test paystone storage
Inside the container running the "psml" Docker image, psml-cli test paystone storage runs the function defined at cli/psml/psml/test/paystone.py.
The user sees terminal output from the Docker container, which in this case is PyTest test results.

By Continuous Integration¶

As discussed in the article on Docker images, this same process can also work in a CI setting. Here, a build step is run via the "deployment" Docker image, which follows this same process. The difference here is that we have a Docker container using the host Docker daemon (on the CI server) to run a separate Docker command, creating a second Docker container.

To facilitate this "independent of environment" usage, most paths that the volume mounts of the PSML Engine are configured to inject for each command have two forms: the CI form, and the local form. For example, in CI, a volume mount of the experiments directory to /home/paystone/experiments would be /workspace/experiments:/home/paystone/experiments; locally, the equivalent mount would be $PWD/experiments:/home/paystone/experiments.

In Production¶

In most cases, when the CLI is used directly on a deployed instance -- such as a serving or training VM -- it is through psml-cli rather than psml. This is because the command is already being run inside the particular Docker image it needs. This point requires some explanation.

Consider that on a developer's machine, the PSML Engine (and therefore the psml entrypoint) already exists by virtue of them having initialized their environment. On a CI server, all steps are running inside the Deployment image, which has the PSML Engine installed. In contrast, on a production server, the PSML Engine does not come pre-installed, nor do any Docker images come pre-pulled. We would need to pull the Deployment image to replicate the scenario that we find ourselves in during CI builds. To pull the Deployment image, only to have it begin execution of its given command by pulling the relevant image, particularly when we already know exactly which image and what configuration to use, is inefficient. Further, no volume mounts are necessary in a production setting: we are not mounting the code from a developer's branch, nor are we mounting the code from the branch of a CI build. In production, we are using master code, which is always baked into the latest version of the Docker image that is available to be pulled from the registry.

That was a long paragraph. To summarize:

On a local machine, we use psml because we don't want to bother the developer with figuring out the Docker configuration.
On a CI server, we use psml because we want the uniformity of having every build step use the Deployment image.
On a production server, we use psml-cli because not all of the work that psml does in building the commands used by these servers is relevant to production (e.g. the volume mounts), and the portions that are relevant are easy to replicate (e.g. the choice of image).

The command that is part of the cloud-init script to launch a serving VM looks roughly like this:

docker run ... us-central1.../experiments:... psml-cli experiments serve ...

Top Level Functionality¶

Here we summarize the broad categories of functions that the CLI performs, without enumerating each of the specific commands. Information on how to get more specific details on the commands of the CLI is given in the next section.

Docker¶

Example: psml docker build [image]

These commands wrap the basic interface of Docker: building, pulling, and pushing images. They allow the user to reference images by their simple name (e.g. "notebooks") rather than their full identifier (e.g. "us-central1-docker.pkg.dev/nicejob-production/notebooks:latest").

This section is mostly here for convenience. It is used most often when another command does not have the needed Docker image available on the host, and it needs to be built. The command, on failure for this reason, should prompt the user with the relevant psml docker ... command to run.

Docs¶

Example: psml docs run

This section is for users to build and view this documentation hub as a Material-MkDocs static site.

Environment¶

Example: psml environment activate

This section relates to the clean code tool set. It has one command for each of the tools, as well as a command to "activate" the environment, optionally for some given experiments.

What does activating an environment mean? Type checking in particular requires that the third-party libraries of the code being checked be installed. With so many different experiments with unique dependencies, it would be difficult to manage this for all experiments at once. When an experiment is "activated", the dependencies for that experiment are installed in the user's host machine virtual environment. These dependencies can then be mounted when the type checking command is run.

Regardless of the experiments included, environment activation also creates a pyproject.toml that holds the configuration for all clean code tools.

Experiments¶

Example: psml experiments train [service] [major] [minor]

This section performs the various actions of the experiment lifecycle. From experiment creation based on a template through to the teardown or promotion of a service revision in the production API, each step is contained in a CLI command within this section.

Notebooks¶

Example: psml notebooks run

This section interfaces with Compute Engine instances for Jupyter Notebook environments, as well as local Jupyter deployment.

Playground¶

Example: psml playground run

This section manages the Playground auxiliary service, including its local and production deployment.

Local deployment might be used for some quick prototyping or exploration, or during development work on this service.

Remote¶

Example: psml remote train [service] [major] [minor]

This section concerns the management and usage of remote Compute Engine instances for training.

When a developer is preparing an experiment and wishes to perform a training run without committing to deploying the full lifecycle, they may do so with psml experiments train. However, with some experiments they may simply not have the resources on their host machine for this command to complete in any kind of reasonable time frame.

For "local" training runs for which the compute resources required exceed the capacity of a local machine, there are remote training machines. This group of commands allows users to create a remote machine, run an experiment on it with all of the code that is currently on their host machine, and then delete the instance.

Instances can be stopped and started if they are needed for extended periods but we want to limit costs.

Services¶

Example: psml services run [service]

This section manages the non-Playground auxiliary services, including their local and production deployments.

Test¶

Example: psml test paystone [package]

This section contains PyTest commands that execute various categories of unit tests from the tests/ directory, or possibly the unit tests of a given experiment.

Help Documentation¶

The above sections are intentionally high-level to avoid getting into details that may change frequently regarding the interface or implementation of the particular commands within each group.

For any engineer curious about the specifics of a particular command, there is help documentation that is automatically generated by Typer based on the documentation in the code. Because the PSML Engine does a bad job of translating help commands, it is recommended to use psml-cli directly for this.

For example, to view the documentation for psml test paystone storage, the command would be psml-cli test paystone storage --help. For the testing section in general, psml-cli test --help. And so on.