Operational Requirements for ML Processes¶

The goal of this article is to summarize the processes of machine learning from the perspective of operations. We list, with brief explanations:

Actions that the ML team must be able to perform.
Resources that the ML team must be able to deploy on a regular basis.
Resources that are deployed once as part of the global infrastructure.

Some actions are taken via a user account, some via a service account, and some are both. Each of the action sub-sections below, which correspond to processes in machine learning, should have its own service account.

Actions¶

Within each process, we group actions by whether they are needed by a user account, a service account, or both. Within each action we briefly summarize why we need it.

Experimentation¶

User account:
- Create Compute Engine instances.
  - Experimentation in notebooks can happen on Compute Engine VMs.
- List Compute Engine instances with details.
  - After creating a Compute Engine VM for experimentation, the developer will need to find its IP address.
- Delete Compute Engine instances.
  - When experimentation concludes, the developer will terminate their instance.
Both:
- Read BigQuery data.
  - Building datasets for training requires pulling data from OLAP.
- Read BigQuery metadata (tables).
  - Experimentation often requires examining what data is available.
- Read graph data.
  - Building datasets for training requires pulling data from the graph database.
- Read files from Cloud Storage.
  - The developer may want to invoke model artifacts from previous experiments.
- Read secrets.
  - The developer may require secrets that grant access to third-party APIs they want to experiment with, e.g. OpenAI.
- List secrets.
  - The developer may want to know what third-party APIs are available, which is discoverable by listing the ML project's secrets.
- Write secrets.
  - The developer may enable access to a new third-party API during experimentation and need to write the corresponding secret.

Training¶

User account:
- Invoke Cloud Build triggers.
  - Training runs that result in artifacts persisted to Cloud Storage can only happen via a service account through a manual CI trigger; developers cannot run these ad-hoc from their local environment.
- Create Compute Engine instances.
  - The developer can execute training remotely on a Compute Engine instance when they require more compute resources.
- Delete Compute Engine instances.
  - The developer is responsible for shutting down their own remote training instances.
Service account:
- Delete Compute Engine instances.
  - The Compute Engine instance will make an API call to shut itself down on completion of the training process.
- Write logs to Logs Explorer.
  - The logs from the Docker container for training are routed to Cloud Logging.
- Read files from Cloud Storage.
  - Training may require reading artifacts from other experiments.
- Write files to Cloud Storage.
  - The end goal of training is to write artifacts to Cloud Storage.
Both:
- Read BigQuery data.
  - Users can execute local training runs, and these runs may require reading OLAP data.
- Read graph data.
  - Users can execute local training runs, and these runs may require reading graph data.
- Read Artifact Registry artifacts.
  - Training happens in a Docker container, whose image is stored on Artifact Registry.

Serving¶

User account:
- Invoke Cloud Build triggers.
  - Service revision deployments can only happen via a service account through a manual CI trigger; developers cannot run these ad-hoc from their local environment.
Service account:
- Read Artifact Registry artifacts.
  - Updates to the experimentation Docker image
- Write Artifact Registry artifacts.
  - Updates to the experimentation Docker image
- Deploy Cloud Endpoints services.
  - This is a passively deployed component of the resource stack discussed below.
- Enable Cloud Endpoints services.
  - This is a passively deployed component of the resource stack discussed below.
- Update OpenAPI spec for Cloud Endpoints services.
  - This is an actively deployed update to a passively deployed component of the resource stack discussed below.
- Create instance templates.
  - This is an actively deployed component of the resource stack discussed below.
- Delete instance templates.
  - This is an actively deployed component of the resource stack discussed below.
- Create instance group managers.
  - This is an actively deployed component of the resource stack discussed below.
- Delete instances group managers.
  - This is an actively deployed component of the resource stack discussed below.
- Create autoscalers.
  - This is an actively deployed component of the resource stack discussed below.
- Delete autoscalers.
  - This is an actively deployed component of the resource stack discussed below.
- Create health checks.
  - This is an actively deployed component of the resource stack discussed below.
- Delete health checks.
  - This is an actively deployed component of the resource stack discussed below.
- Create backend services.
  - This is an actively deployed component of the resource stack discussed below.
- Delete backend services.
  - This is an actively deployed component of the resource stack discussed below.
- Create URL maps.
  - This is a passively deployed component of the resource stack discussed below.
- Update URL maps.
  - This is an actively deployed update to a passively deployed component of the resource stack discussed below.
- Delete URL maps.
  - This is a passively deployed component of the resource stack discussed below.
- Create target HTTP proxies.
  - This is a passively deployed component of the resource stack discussed below.
- Delete target HTTP proxies.
  - This is a passively deployed component of the resource stack discussed below.
- Create forwarding rules.
  - This is a passively deployed component of the resource stack discussed below.
- Delete forwarding rules.
  - This is a passively deployed component of the resource stack discussed below.
Both:
- Read BigQuery data.
  - Developers have the option of serving service revisions locally for ad-hoc testing, and whether local or deployed, those serving applications may require access to OLAP data.
- Read graph database data.
  - Developers have the option of serving service revisions locally for ad-hoc testing, and whether local or deployed, those serving applications may require access to graph data.
- Read Spanner data.
  - Developers have the option of serving service revisions locally for ad-hoc testing, and whether local or deployed, those serving applications may require access to OLTP data.
- Read files from Cloud Storage.
  - Developers have the option of serving service revisions locally for ad-hoc testing, and whether local or deployed, those serving applications may require access to model artifacts from other experiments.

CI/CD¶

Service account:
- Read Artifact Registry artifacts.
  - Experimentation, training, and serving happen within Docker containers, whose images are updated (pulled, rebuilt and pushed) on a master branch CI trigger.
- Write Artifact Registry artifacts.
  - Experimentation, training, and serving happen within Docker containers, whose images are updated (pulled, rebuilt and pushed) on a master branch CI trigger.
- Deploy Cloud Run service revisions.
  - Most auxiliary services are deployed on Cloud Run.
- Restart Compute Engine instances.
  - Some auxiliary services are deployed on Compute Engine; these have their Docker images reloaded by restarting the instance.

Actively Deployed Resources¶

We define "actively deployed resources" as those which are continuously being created and torn down as the result of the experiment lifecycle. These deployments and undeployments are the result of manually triggered actions by developers via our CI framework.

Compute Engine instances.
- Instances running Jupyter for the Experimentation process.
- Instances executing training modules for the Training process.
Instance templates.
- A component of the resource stack for service revisions.
Instance group managers.
- A component of the resource stack for service revisions.
Autoscalers.
- A component of the resource stack for service revisions.
HTTP health checks.
- A component of the resource stack for service revisions.
Backend services.
- A component of the resource stack for service revisions.

Passively Deployed Resources¶

We define "passively deployed resources" as those which are created once at the conception of a GCP project, and are never torn down. Some of these may regularly be updated as part of experiment lifecycles. Below we note whether each resource falls into this "static" or "dynamic" categorization.

URL map.
- Dynamic, regularly updated.
Target HTTP proxy.
- Static, not updated.
Forwarding rule.
- Static, not updated.
Cloud Endpoint service.
- Static, not updated.
Cloud Endpoint OpenAPI schema.
- Dynamic, regularly updated.
Cloud Run services.
- Not applicable to experiment lifecycles.
- Dynamic, regularly updated.
Compute Enigne instances.
- Instances running some auxiliary services.
- Not applicable to experiment lifecycles.
- Static, infrequently updated.