The MLServing API¶

As described in our definition of machine learning as a second party, machine learning services are made available to the broader organization via a single API. We refer to this as the "MLServing API".

What is the structure of this interface? What compute components make up its architecture? How are consumers expected to interact with the API? These are the questions we will answer in this article, both in written form and using visuals.

API Versioning¶

Hierarchy¶

To quote semver.org:

Given a version number major.minor.patch, increment the: 1. Major version when you make incompatible API changes. 2. Minor version when you add functionality in a backwards compatible manner. 3. Patch version when you make backwards compatible bug fixes.

In line with this approach, the endpoints of the MLServing API have 5 levels of hierarchy: service, major version, minor version, patch version, and path. Each of these has a precise meaning.

A Service is a machine learning capability.
- It could be a general-purpose capability with specific applications, or a specific capability for a particular application, depending on how we've decided to solve a problem.
A major version of a Service is an implementation of the Service's capability via a particular schema.
- This includes the input and output schema for the Service.
- The major version of a service is incremented if and only if the schema for the Service changes.
A minor version of a Service is a particular solution that provides the Service's capability as the result of executing the experiment lifecycle.
- Whenever the experiment lifecycle is iterated and a new solution is produced, the result is a new minor version, provided the schema does not change.
A patch version of a Service is a re-deployment of a particular solution that updates the environment surrounding the solution.
- The artifacts used by a solution cannot change from patch version to patch version. This is viewed as a new solution to the problem, which requires iterating the experiment lifecycle, and therefore creating a new minor version.
- The serving logic used by a solution may change from patch version to patch version. Changing the serving logic without changing the artifacts used is viewed as a bug fix.
- The infrastructure for training and serving for the experiment may change from patch version to patch version.
- The underlying machine learning platform may change from patch version to patch version.
- When a new patch version is deployed, the previous patch version is undeployed.
A path within a Service provides an alternative to the central problem.
- Paths may be used to expose multiple solutions that vary in how they manage the trade-off between accuracy and latency. For example, within an NLP capability like content generation, a single service may expose a large language model as well as a simple recurrent network. One will be more accurate, and the other will be faster.
- Paths must share Instance and Prediction schema, though their Parameters schema may differ.

Let's highlight some of the key takeaways.

First, major versions are explicitly tied to schema. This is because the schema of a service forms the basis for how it is evaluated. The same suite of tests can be used to evaluate any number of solutions as long as they have the same input and output schema. This allows us to compare minor versions in a consistent way. They can also be used the same way by consumers.

Second, patch versions have a specific definition, and the goal is to remove a decision point for engineers: it should be obvious when a new revision is a minor version update or a patch version update. The bullet points above can be used as a checklist for this decision.

Third, multiple paths are meant only to provide alternative approaches to the same problem, not to group problems together within a Service. This is because of the experimentation process. Minor versions are required to iteratively improve on their performance according to the established tests and metrics for the associated major version. When multiple capabilities are contained within the same Service, the linearity of improvement is broken. Does one minor version that does better at sub-problem A and worse at sub-problem B than the existing minor version deserve to be promoted? This is a very difficult question to answer, and an undesirable quality of an experimentation framework.

Naming¶

Major versions are named according to the format "v#", starting with 1. Minor versions are formatted "m#", starting with 0. Patch versions are formatted "p#", starting with 0.

The 3rd patch version within the 5th minor version of the 2nd major version of a service called "housing", with a path called "predict", would have the following endpoint in the MLServing API: /housing/v2/m4/p2/predict.

Path Redirects¶

All revisions are deployed to at least one endpoint in the MLServing API. These paths follow the structure of /service/major/minor/patch/path. There also exist convenience paths under which endpoints are deployed. These endpoints redirect their requests to one of the core endpoints based on the following rules regarding versions.

/service/major/minor/path redirects to the latest patch version.
- For example, if the "housing" service, with a path "predict", had a major version 1 with a minor version 3 which had 5 patch versions, then /housing/v1/m3/predict would redirect to /housing/v1/m3/p5/predict.
/service/major/path redirects to the currently "promoted" minor version (or versions, if an A/B test is active).
- For example, if the "housing" service had a successfully promoted minor version 3 under major version 1, with a path "predict", then /housing/v1/predict would redirect to /housing/v1/m3/predict.
- If the "housing" service had a successfully promoted minor version 3 under major version 1, and a new minor version 4 was being A/B tested to potentially be promoted over version 3, then /housing/v1/predict would route some percentage of its traffic to /housing/v1/m3/predict, and the remainder to /housing/v1/m4/predict.

Combining these rules, we find that at all times, invocations of the MLServing API that hit /service/major/path endpoints will invoke the latest patch version of the currently promoted revision(s).

It is therefore recommended that service consumers use the /service/major/path endpoint for service invocation.

Simultaneity¶

Multiple major versions will only be available simultaneously during a period of transition, and these periods will be coordinated between the machine learning team and the product leaders. Multiple active major versions is never an intended long-term state of the system.

Multiple minor versions may serve the traffic of a single major version endpoint if an A/B test between the versions is currently active. This is a testing mechanism cared for within the "walls" of machine learning by the machine learning engineers, and does not require any thought or intervention on the part of the API consumer, assuming they are using the recommended endpoints above.

Multiple patch versions will never be available simultaneously, save for a brief moment of transition that is handled automatically at the time of the new patch version's deployment.

Visual Summary¶

APIVersioningDiagram

Compute Stack¶

There are common resources used by all deployed ML services, and there are resources unique to each revision. We describe the stack in the order in which resources are created, beginning with the unique resources and concluding with the common resources.

We follow the Cloud Endpoints for Managed Instance Groups document as a basis for deploying this API.

Instance Template¶

An instance template is a specification for the creation of a Compute Engine VM.

Our instance templates use a cloud-init script defined as instance metadata to trigger the automatic deployment of two Docker containers. One runs the FastAPI application for the given service revision; the other runs an ESPv2 proxy that directs the instance's incoming traffic to the FastAPI container. The port that the ESPv2 container is exposed on is also defined in the cloud-init script.

The revision infrastructure manages the hardware specification of the instance template, namely the machine type, GPU configuration, and disk configuration.

So far, with only this resource, we can deploy a single Compute Engine VM that, if it were to somehow receive HTTP(S) traffic on a certain port, would invoke a container running our service revision's serving application.

Instance Group¶

An instance group, specifically a Managed Instance Group, is a collection of stateless, independent Compute Engine VMs which run the same instance template.

An instance group is not configurable by the revision infrastructure.

With these two resources, we can now deploy multiple copies of a Compute Engine VM that runs our service revision's serving application on a certain port.

Autoscaler¶

An autoscaler attaches to a Managed Instance Group and gives it the ability to add or remove instances based on its demand.

The autoscaler requires a metric over the instances in its group in order to decide when to scale "in" (remove instances) or "out" (add instances). Currently, this metric is CPU utilization, and the revision infrastructure sets its target percentage, along with the minimum and maximum number of instances. The infrastructure parameter "cooldown" gives a number of seconds during which autoscaling metrics are not evaluated, essentially giving instances time to "boot up" before their usage is tracked. This is equivalent to the request timeout, which is roughly equivalent to an upper bound on the time to load artifacts.

With the above three resources deployed, we have a set of independent stateless VMs running copies of our serving application with an autoscaling policy.

Health Check¶

A health check attaches to a Managed Instance Group and continually ensures the availability of each of its instances by invoking a given health endpoint on a frequent schedule.

The health check happens at a cadence equal to the request timeout set in the infrastructure configuration for the revision.

With this, we have an autoscaling set of VMs running our serving application, with the assurance that any unresponsive instances will have traffic directed away from them.

Backend Service¶

A backend service exposes a set of VMs to a load balancer, defining how traffic to the load balancer is routed to the VMs. In our case, it exposes the Managed Instance Group to the load balancer.

The backend service sets the request timeout parameter as well as the amount of time that requests are given to complete when a VM is scaled in, both of which come from the infrastructure specification.

Our autoscaling instance group now has an interface through which its instances can receive traffic.

URL Map¶

A URL map routes incoming HTTP requests to various backend services based on a set of rules that operate over the request's host, path, and headers.

Its configuration includes path and request header rules for routing requests, and backend services that are invoked.

The URL map is the first resource in the stack that is deployed singularly for the entire MLServing API as part of the load balancer. It is also the most frequently updated among all of the shared resources; each deployment of a backend service requires an update to the URL map.

Since it's a shared resource, the URL map is configured independently of the revision infrastructure configuration.

The URL map has unified all of our backend services running independent ML service revisions into a single interface.

Target HTTP Proxy¶

The target HTTP proxy manages connections to the backends managed by an associated URL map, terminating and creating connections as needed.

Its configuration is simple, containing only a name and a URL map to be associated with.

With this component, our URL map can now receive requests so that they can be routed to the backend services, which can also receive requests.

Forwarding Rule¶

A forwarding rule, particularly an external forwarding rule, accepts traffic from any client system with internet access, including those outside of Google Cloud, and forwards the traffic to a target HTTP proxy.

Its configuration includes an IP address which accepts traffic, a load balancing scheme (for us, external managed), a port range to expose, and a target HTTP proxy.

With this component, our API has a frontend: an IP address and range of exposed ports that forward their requests to a target proxy, which creates a connection through the URL map to available backend services containing serving applications running in containers.

Cloud Endpoint¶

Cloud Endpoints is an API management system that helps to secure, monitor, analyze, and set quotas on APIs, using the OpenAPI specification. It provides a domain for the IP address which we provide as its backend.

Its configuration is contained within an openapi.yaml file which defines the endpoints within the API, their schema, the IP address at which requests are received, and the security associated with the API. It is updated in unison with the URL map, as the paths in these two resources are intended to reflect each other at all times.

With this component, our API has telemetry, but it also has security provided out-of-the-box with options for configuration. For example, it allows us to accept traffic only when it is authenticated with a GCP service account, or using a managed API key.

Visual Summary¶

ComputeStackDiagram

Expected Usage¶

Consumers of the MLServing API will benefit from its greatly simplified interface when it comes to exposing the results of continuous experimentation and re-deployment of services. While the many major, minor, and patch versions of each of our services result in many different resource stacks and many different endpoints in the unified API, the relevant portion of the interface for consumers is relatively small.

Here are the rules for identifying the correct way to invoke the MLServing API:

Identify the service you want to invoke. Let's call this our $SERVICE variable.
Consult these docs in the "ML Services" section to see which is the latest major version of the service. Let's call this $MAJOR.
Find also in these docs the input and output schema for this major version of the service.
Determine which path you want to invoke the given service with. Let's call this $PATH.
Construct your target URL as follows: https://mlserving.nicejob-production.cloud.goog/$SERVICE/$MAJOR/$PATH.
Make a request to the target URL with a request body that follows the schema of the major version for the service.

As a summary of the above two sections, we quickly recap what happens when this request is made:

The Cloud Endpoint service converts the domain to the IP address of our forwarding rule.
The forwarding rule forwards the request to the target HTTP proxy.
The target HTTP proxy creates a connection, using the URL map to analyze the request's properties and select the appropriate backend service.
- The URL map maps the "service + major" path to the "service + major + minor" path for the currently promoted minor version.
- The "service + major + minor" path is mapped to the "service + major + minor + patch" path for the latest patch version.
The chosen backend service uses its traffic distributions rules -- simple as they are -- to select the instance group to which the request is sent. It is the one available instance group in the backend configuration.
The instance group evenly distributes these requests among its VMs, such as by a round-robin method.
The chosen VM takes in the request on its exposed port, which is running the ESPv2 proxy.
The ESPv2 proxy forwards the request to the container running our serving application.
The serving application resolves the request, sending a response to the proxy, which is sent back to the client, completing the transaction.
The target HTTP proxy terminates the connection.