Gathering Data¶
In experimentation, when writing a training module, and when writing a serving application, step one is generally always gathering the necessary data.
While the data being gathered can obviously vary wildly from experiment to experiment, its process is fairly standardized in our systems. Our key principle is reproducibility, and out of that falls 3 points of emphasis:
- Use of the feature store.
- Structuring data access as a training task.
- Use of auxiliary public data.
Reproducibility¶
The key principle underlying each of the remaining points is reproducibility. Reproducibility can be a challenge in machine learning, particular in academic settings, but within our own company it should be a non-issue. Once written, an experiment's training process should be able to run the exact same data generation process at any point in time.
Practically, this can be exercised in the following ways:
- Use a single point of access.
- There should be one, and only one, way to access a particular piece of data.
- This should be independent of the scale of the data or the environment; building a large dataset for training and grabbing information for a single entity for serving should use the same point of access.
- Store processes, not data.
- For example, when data is queried, whether from internal or external systems, do not save the data as an artifact manually (for example in cloud storage) and make the code a simple read of that file.
- Instead, the code should be the process that queried the data from the internal/external system; we rely on our training processes to handle the artifact generation.
- Random processes must be seeded. This is not something we can enforce in tooling, so it is part of the style guide.
- Allow point-in-time access.
- This mostly comes through tooling, and this is discussed below with the feature store.
- Generally, though, it means that any query should be able to be run with an additional timestamp parameter; with this parameter, all data that was ingested after the given timestamp would be ignored, effectively making the query "as of" that time.
Feature Store¶
The feature store is our single point of access for both row and column oriented raw data, as well as aggregations and functions on raw data. It accomplishes the reproducibility goals above due to two of its key properties:
- It is a single point of access that consumes data from OLTP, OLAP, and graph sources.
- The OLTP data source serves single or low-multiple entity access from the feature store.
- The OLAP data source serves high-multiple entity access from the feature store.
- The graph data source serves enrichments, which are functions over the raw data present in OLTP and OLAP that leverage the graph structure to perform more powerful computations.
- The choice of data source is made by the feature store by default according to these rules, giving one unified interface to all 3 data sources.
- It stores functions, not data.
- The feature store is a collection of functions over single instances which can be mapped to many instances.
- Rather than running these functions on a schedule and storing their results, we compute the values of features for given entities on-demand.
- The functions we store necessarily provide point-in-time access parameters.
- Every function in the feature store takes in a timestamp which can be used to restrict the ingestion timestamps of data that it considers.
For more details on how the feature store works, see the paystone package article.
Training Task¶
Gathering data for a training module should always be one of the tasks of its Experiment computation graph. This follows from the principle of storing processes instead of data.
It is also important because the artifacts generated by a training procedure are the outputs of the tasks in its Experiment. This means that by having a task whose output is the experiment's data, we implicitly save that data as an artifact. Data is a key piece of experiment lineage.
The task should typically produce 6 data objects: training features, training labels, validation features, validation labels, testing features, and testing labels. paystone.training.types.data.SplitDataset
is a convenience data model for holding these.
Auxiliary Public Data¶
Not all training data will necessarily come from the feature store. Sometimes, auxiliary data is needed, particularly for early iterations of a service where data may be sparse.
When using public data to enrich a dataset, follow the same principles as above. Include its retrieval in the data retrieval task of the experiment. Do not retrieve it via an ad-hoc process and store it as a file to be loaded.
Since this data is going to have less documentation than data that comes from the CDP, be sure to emphasize it in the documentation of the experiment itself.