Skip to content

Documentation

Perhaps more than in other software engineering subdisciplines, documentation is a critical component of machine learning. As noted in the machine learning lifecycle, no experiment is complete without documentation of the training routine employed. In this article, we'll briefly go over the primary audience of experiment documentation, the purpose of documentation in machine learning development, and the principles to consider when writing.

Other forms of documentation include inline code documentation, API documentation, and conceptual documentation, which is this entire hub. We touch on these briefly as well.

Experiment Documentation

Purpose and Audience

Machine learning requires more trial and error than any other software development. When an idea fails, it is just as much of a learning opportunity as when an idea succeeds. The role of documentation is to persist these failures, so that the lessons learned are never forgotten.

First, for the engineer documenting the experimentation, it helps sort through the process and crystalize the reasons for success or failure. The act of writing retrospectively about an experiment with the goal of being understandable to others forces one to distill the knowledge gained from the process, which places emphasis on what was really important. It is said that the best way to learn something is to teach it, and in a similar way, explaining why an experiment went the way it did is perhaps the best way to understand it.

Second, for the audience, it is a guide for future work. While of course everyone in the organization is welcome to read any documentation we produce, the primary benefit of documenting experiments is to inform the future developers of machine learning services, which are the machine learning engineers. The writer of experiment documentation should tailor the material to this audience.

This means that it is acceptable to take for granted foundational knowledge of machine learning algorithms. It is not necessary to explain how algorithms work or explain basic concepts in detail.

It also means that the focus should be primarily on justifying the decisions made, and explaining what led to the end result. Were there algorithms tried that didn't work? What change led to the greatest boost in performance? What changes had the least effect on performance?

Keep in mind that these documents are focused on solutions, not the problem. The service specification documents will explain all that is required about the problem being solved, and that does not need to be re-hashed in these documents.

In general, the resulting document should be something that the writer is confident they or any other machine learning engineer could return to months or years later and use to inform a subsequent iteration on the service.

Principles

  • Focus on the unique aspects of the experiment: what can we learn from this experiment that we haven't learned from previous experiments?
  • Spend time on the key decisions: roughly, the amount of writing about a particular decision point in the experiment should be proportional to the amount of time spent on it during experimentation.
  • Consider what your future self would appreciate: if there was an exploration that took a significant amount of time and led to a concise conclusion, by thoroughly documenting that process and conclusion, you relieve your future self from having to re-live that experience.
  • Write with empathy: if something can be explained in simple, easy to follow language, prefer that to dense mathematical language that may be difficult to parse.
  • Highlight changes from the previous iteration: it is unnecessary to re-tread the decisions already made previously, so focus on what you have done differently.
  • Remember your audience: you can save time on trivial explanations by leveraging the fact that your audience speaks your language.

Lack of a Template

At this point, we don't know enough about the structure of experiment documentation to give it a strict template. In fact, it is possible that such a template never emerges. Machine learning experiments can vary wildly, especially when considering different modalities and training paradigms.

If we note repeated patterns in experiment documents, we will make an effort to codify them. This will not be something we look for, but rather something that we respond to if it presents itself.

Other Documentation

API Documentation for Auxiliary Services

The documentation for auxiliary service APIs comes primarily from the data models involved. Well documented data models will produce a well documented, user-friendly API. Documentation of data models is covered in more detail in the Style Guide.

Conceptual documentation for these APIs exists within this documentation hub and is somewhat separate from the more technical details of the API reference.

Code Documentation

There is an intentional lack of emphasis on docstrings in the machine learning codebase at Paystone. Compared to the other forms of documentation, docstrings tend to have less value, and can take more effort to maintain.

Docstrings tend to accomplish a few things. They document the expected types of arguments and return values, they explain the role of arguments in a function, and they summarize what a function does.

These things can be useful, but they are often unnecessary. Type documentation is unnecessary in an environment with complete static typing; the type annotations double as documentation in this sense.When an argument to a function, or the entire function itself, is unclear in its purpose given all the available context, it is possibly a case of poor naming or a sub-optimal location for the function. It could also be a sign that a function's scope is too large. Implementation detail, on the other hand, is often not a high value thing to document, and it can change frequently.

To be clear, no engineer is discouraged from writing docstrings. In cases where a high level function close to the public interface of an area of the codebase is somewhat unclear in its purpose or usage, a docstring can help. The general idea, though, is that well organized code with clear function and argument names and well abstracted concepts may not benefit much from an investment in docstring maintenance. If you find yourself essentially repeating the name of a function or an argument to describe it -- e.g. the "filename" argument of a "get" method is the name of the file you want to get -- you are likely writing low value documentation.

This is not to be confused with comments. Comments are a very useful tool when writing code, and their usage is covered in our Style Guide.

Conceptual Documentation

Principles of conceptual documentation were covered in detail on the home page of this documentation hub.