Jupyter Notebooks¶

Jupyter is a web-based interactive computing platform. It combines code, narrative text, visualizations, and rich outputs into a readable document. This format is ideally suited for machine learning experimentation.

That said, our usage of Jupyter is strictly confined to the experimentation phase of the machine learning lifecycle. For all of its benefits when prototyping ideas, Jupyter's fast-paced interactive development atmosphere lacks the code cleanliness tools of an IDE. This can easily lead to chaotic, hard to follow, hard to maintain code. Because of this, it is not the ideal place to develop code that is meant to persist in production codebases.

In this article, we briefly discuss the treatment of Jupyter notebook code, and the infrastructure in place to develop it.

Developing Notebooks¶

We use Jupyter Lab as our interface to Jupyter, and the only available kernel is Python. Both local and remote (on Compute Engine) Jupyter development environments are packaged as Docker containers, using the same image.

Compute Engine Instances¶

Often, experimentation requires loading and invoking large model artifacts which require both large amounts of memory and at least one GPU. We do not expect engineers to have these resources on-hand in their local machines; when these resources are required, one can easily provision a new Compute Engine instance.

The Compute Engine instance provisioned with psml notebooks create -- which has options to customize resources -- launches the same Docker container as the local psml notebooks run command, and securely exposes the port on which Jupyter runs. By obtaining the IP address of the instance with psml notebooks get, one can easily access a remote Jupyter instance, which runs with no added latency and feels the same as developing locally.

Customization¶

Jupyter Lab can be customized on a personal basis by each engineer. Personal configuration files are automatically saved on each change in Google Cloud Storage for each engineer, and these are loaded whether running locally or on Compute Engine.

Extensions¶

The following extensions are included with the Docker image:

jupyterlab_execute_time: this displays the time each cell has taken to execute most recently.
jupyterlab_system-monitor: this displays the resource usage of Jupyter, specifically CPU and RAM.

How Notebooks are Organized¶

We have two categories of notebooks saved in Cloud Storage, and these are the top-level directories that are copied into the Docker container that runs Jupyter on its instantiation.

Personal notebooks.
Common notebooks.

An engineer developing a notebook which they are using to simply test out a basic idea or even to do some ad-hoc data analysis may choose to place the notebook under the personal/ directory. These notebooks will be copied into future Docker containers (local and remote) only when they deploy them.

An engineer developing an experimental idea as part of the machine learning lifecycle should place the notebook under the common/ directory. These notebooks are copied to all Jupyter Docker containers for all engineers.

There is no assigned or recommended directory structure for common notebooks. We will all just do our best to keep it organized.

Common notebooks save a "version" copy each time the notebook is saved, whether an auto-save or a manual save. This is our simplistic approach to notebook "version control". The versions are saved in a versions/ directory with the filename being the timestamp of saving, and the parent directory being the name of the notebook. The main copy of the notebook, whose path in the bucket is the same as the path in the container, is then overwritten.