Skip to content

Moving Forward From Experimentation

A key question in the lifecycle of creating machine learning services, and the focus of this article, is: when do I move out of Jupyter, and create a formal experiment for my idea?

There is no direct answer to this question. Instead, we have a set of principles that guide our development process.

Start Fast, Then Slow Down

The value of machine learning is that it can improve when given high quality feedback as data. For a machine learning service to receive feedback, it needs to be running in production and making predictions. Because of this, the goal of a first iteration of any service is to get something working as quickly as possible. This means spending less time making small tweaks and improvements in experimentation.

As an engineer working on the first version of a new service, prioritize speed over accuracy.

Accuracy comes in future versions, when feedback is being collected and we have a larger toolkit with which to improve the service. With that in mind, an engineer working on a new revision of an existing service can take their time experimenting with more ideas without the same urgency of the initial version.

Remember: every day that we spend with zero versions of a service deployed is a day we lose out on valuable feedback data.

Make Incremental Changes

Our machine learning experiment lifecycle is a realization of the scientific process in a software development setting. The scientific process relies heavily on being able to measure the effects of alterations to experiments in isolation, and that is an important point to rememeber when experimenting with machine learning.

Make liberal use of minor version bumps to isolate changes from experiment to experiment. If an architecture change is mixed with a change in the dataset curation process, it may be difficult to separate the effects, which will hinder future experiments.

As a machine learning engineer working on improving an existing service, one need not feel as though they can only make one minor version bump before having to move on from the project. It may often take many new minor versions to arrive at one that is promoted and remains as the main revision.

Clean Code is Fast Code

One reason an engineer may be inclined to stay within Jupyter Notebooks for experimentation is that it allows for rapid prototyping. One may feel that the added rigour of writing "production code" may slow down development.

While it is true to some extent that "notebook code" is easier and quicker to develop, the tools and rules we have supplemented this development environment with will make it easier to write code efficiently. Static type checking, an emphasis on organizing experiments into tasks, and peer review are all effective productivity tools that are somewhat inaccessible in a notebook environment.

On the flip side, even though type checking is not available within Jupyter, making an effort to write statically compliant code in notebooks will help greatly when it comes time to port an idea from a notebook to a formal experiment. Engineers are encouraged to annotate even notebook code.

Another potential reason for staying in Jupyter Notebooks beyond the point that an idea has been found is the compute power available, compared to local training runs. To alleviate this concern, we have remote training, a section of our CLI dedicated to using Compute Engine VMs to train experiments being developed locally.

Failed Experiments Are Fine

One reason an engineer may be hesitant to move on from experimentation in notebooks is that they don't have full confidence that it will actually work when a full training run is executed. An idea may have been tested on a small sample training set and gone reasonably well, but without a gaurantee that it will translate to a successful experiment, they hesitate and instead continue tinkering in the notebook.

This is a flawed thought pattern. Failed experiments are a key part of the machine learning lifecycle, and the lifecycle itself happily facilitates failed experiment flows. One is never in danger of accidentally pushing a failed experiment into production; the infrastructure won't allow it. Because of that, lifting an idea into the full lifecycle serves only to collect more information on its performance.

For example, one may find through their data and regression tests that a certain idea has a specific failure mode. In the case of a text classification problem, a regression test may have been set up for particularly angry vocabulary that the service does surprisingly poorly on.

Now, it is possible to train an experiment to completion locally, and view the artifacts locally, inspecting the test results and deciding whether to commit to a "real" run. While there's nothing in place to stop an engineer from doing that, it is a somewhat discouraged pattern. Ad-hoc runs of experiments where artifacts are saved locally and ultimately discarded rob the future engineer, as well as the rest of the team in the present, of the opportunity to learn from that failure. It becomes lost to time.

Structure is Valuable

In organizing the potentially scattered ideas of a notebook into a formal Python module, one may spot opportunities for efficiency gains that were previously hard to identify. This is part of the value of converting one's own notebook code themselves.

The process of writing a serving application around an idea is valuable as well, as it makes clear what the important artifacts are and what their interface should be. As discussed in the lifecycle article, writing the serving application is deliberately put before the training module because it informs the structure of the training module. Along those lines, writing a serving application for a mostly-formed idea may help crystalize how that idea should be implemented as a training routine.