Storage¶

paystone.storage has an interface with 3 primary categories:

PSStorage is a wrapper to the google.cloud.storage client which provides all the basic functionality of Cloud Storage.
- Saving files and directories to buckets.
- Downloading files and directories from buckets.
- Writing content to files in buckets.
- Reading content from files in buckets.
- Reading the metadata of files in buckets.
- Listing files in buckets.
paystone.storage.artifacts has read_remote_or_local_artifact and write_remote_or_local_artifact, which reads "Artifact" objects from either Cloud Storage or local storage, conditional on the presence of a LOCAL environment variable.
- For more information on what Artifacts are, see the documentation on paystone.training.
paystone.storage.serialization contains serialize and deserialize, which are a pair of functions for handling Artifacts.

Serialization¶

serialize is a function that provides a single, unified interface to the subtlely complicated function of arbitrary serialization. How do we serialize the output of any arbitrary Task from a training procedure, when some things can be pickled (e.g. sklearn models), some can't (e.g. PyTorch models), and some are actually many files (e.g. HuggingFace transformers)?

The answer we have devised here is to provide custom serializers (and deserializers) for each unique scenario. Serializers and deserializers come in pairs: the serializer knows how to turn an object of a certain type into a bytestream, and the deserializer knows how to read that bytestream into an object.

How do we then deserialize these artifacts? Serialization appends a single byte to the file, which serves as an indicator of which kind of artifact is serialized. Deserialization begins by popping off this final byte, routing to the correct deserializer, and proceeding from there. ("But wait, couldn't you just communicate that information by the file extension, instead of having the same extension for everything and doing this byte nonsense?" to which I say, hey shut up I'll get to it).

All custom serializers output a single bytestream representing one file. When their procedure would output multiple files, these file contents are instead zipped together, and the result is again a single bytestream; the deserializer would then be aware that it is dealing with a zipped file bytestream, and begin by unzipping.

Custom Serializers¶

Currently, we have custom serializers and deserializers for:

Tensorflow models.
Transformers pipelines.

The rest of our task outputs are converted to artifacts by cloudpickle.