Skip to content

Enrichment

As discussed in the description of the role of machine learning engineers, auxiliary services are created by machine learning engineers when a capability that services their needs is seen to have utility as part of their "public API" that is consumable by the application. Enrichments, while driven primarily by the needs of machine learning experiments, fall under this category. What are enrichments?

The Enrichment API is a service deployed on Cloud Run which exposes a collection of real-time calculations over the graph component of the CDP. The graph database powers calculations which are virtually impossible to do with any efficiency in relational databases or data warehouses. We call these graph-powered calculations "enrichments", as they enrich the raw data that exists in the relational component of our CDP. Typically, enrichments are calculations which involve traversing many relationships in the graph in order to produce a result for even a single entity.

There are two key stakeholders in this service: growth platform engineers, and machine learning engineers. Growth platform engineers are purely consumers of the service, while machine learning engineers are both consumers and producers. It is important, then, for this article to set the expectations of both sides around responsibilities, intended usage, and the relationship of the service to both sides.

We'll also discuss in general how the service works, while leaving the implementation details to the service documentation itself.

Development and Consumption by Machine Learning

In keeping with the rest of this documentation, we focus on the general processes and concepts here. For details on how to develop enrichments -- the structure of the paths, the expected request and response data structures -- see the documentation in the service itself.

To understand both how machine learning engineers should develop enrichments, and how they should consume them, we can examine its role in the experiment lifecycle. This will answer both questions.

During the experimentation phase, a machine learning engineer may explore many potential graph computations that could be useful in solving their problem. As we have discussed, our goal is to iterate with some velocity between experimentation and service revisions, to take full advantage of the lifecycle and its tooling. When an experimental idea seems at least somewhat likely to produce promising results, the remainder of the lifecycle is activated, and a full experiment is created.

It is at the transition between experimentation in notebooks and development of a service revision that the engineer formalizes the enrichments which are necessary for their experiment -- and only these -- by making Github issues for their creation. To be clear, the process would look like this, to take an example:

  1. The engineer experiments with ideas for a new service or a revision in notebooks; they use existing enrichments by invoking the enrichment service using paystone.api, and they write graph queries using paystone.neo for new enrichment ideas they may possibly find useful.
  2. Out of the 8 new graph queries they create, they find that 3 are useful for the model they intend to build into a full experiment.
  3. Before proceeding to the next step of the lifecycle, they create 3 Github issues: one for each of the new graph queries they require. The goal of each issue is to add one of the graph queries to the Enrichment API as an endpoint.
  4. After completing work on these 3 issues, they can resume the remainder of the lifecycle.
  5. In serving and training modules where their graph queries (now enrichments) are required, they invoke the enrichment service using paystone.api.
    • For steps 4 and 5 to have a smooth transition, fairly fast turnaround on peer reviews will be required.

Injecting enrichments into the lifecycle in this way ensures that every enrichment we build has an immediate use case, and no effort is wasted. Our goal is simply to stay focused as a team on delivering customer benefits, and enrichments driven by the needs of a machine learning experiment deliver customer benefits.

So enrichments follow roughly this lifecycle:

  • A pool of potential enrichment ideas are prototyped in a notebook as part of an experiment.
  • Existing enrichments are consumed via the deployed service during experimentation.
  • The new enrichment ideas that are ultimately useful for a formal experiment are identified and developed to completion.
  • Enrichments are consumed strictly through the deployed service for the remainder of the lifecycle.

Consumption by the Growth Platform

We expose the Enrichment API as a component of the public API of machine learning because these are data transformations that have use cases beyond machine learning features. Graph traversal functions unlock information hidden in the connections between data points, and with Paystone's schema, many entities are highly connected. Essentially, enrichments are intended to augment our schema with unmaterialized calculated fields that make use of graph structures.

While the growth platform is not responsible for the development of enrichments, they can propose or request new enrichments when opportunities are identified. While it is the machine learning experiment lifecycle that primarily drives innovation in the enrichment API (as discussed above), the ultimate goal is for enrichments to be built in response to immediate use cases. Where the growth platform has an immediate use case that machine learning does not, we should still build new enrichments.

As discussed below, growth platform requests are handled by Platform and Productivity, meaning these requests flow through that product group.

Whether the result of a request by the growth platform, or the result of a machine learning experiment, enrichments are meant to be highly discoverable. For this, we have the documentation of the enrichment service itself. Both growth platform engineers and product leaders should make a point of referencing this documentation with regularity to check on new enrichments and potentially spark ideas. It could lead to new product ideas, or highlight ways of improving existing features.

The value of the enrichment API will likely be primarily felt through its contributions to machine learning services, which is why machine learning is in charge of developing it. But while consumption by the growth platform is secondary, it has the potential to make a substantial impact on the platform's feature set.

Prioritizing Two Consumers

We have established that both the growth platform and machine learning experiments can give rise to new enrichment backlog items. Wherever two consumers contribute requests to the same backlog, there is a prioritization question.

We handle this prioritization by dividing the requests of the two consumers into separate camps. When a machine learning engineer requires a new enrichment as part of an experiment, that is their responsibility, and it is encapsulated within the experiment lifecycle. When the growth platform requires a new enrichment, it is considered a machine learning ops task. While it doesn't perfectly fit the mold of "MLOps", because it is not directly facilitating or improving the development of machine learning services, it falls into the bucket by virtue of not being part of the experiment lifecycle. Broadly, when work that is the responsibility of the machine learning team does not fall within the experiment lifecycle, it is considered machine learning ops work.

How It Works

Enrichments are queries of the graph database component of the CDP, with potential post-processing of the results. Each enrichment has a unique, friendly name.

Enrichments can be queried for one entity or multiple entities, with the understanding that it is a transactional system: queries of many entities at once may not scale as well as they would in an analytical system such as a data warehouse. Optional caching can mitigate some of the performance setbacks of large queries, if the request calls for it.

As they are computed fields and not stored data, enrichments are easily subject to parameterization.

Combining these concepts, an enrichment request will provide the following information:

  • The name of an enrichment to fetch.
  • The entities for which the enrichment should be fetched.
  • Any parameterization for the enrichment's computation.

It is up to the developer of the enrichment to decide what parameterization becomes part of the interface of that endpoint.