Skip to content

Getting Started

This document is intended to help the following people:

  • New Machine Learning Engineers getting themselves set up and acquainted with our systems
  • Existing Machine Learning Engineers that need to reset their environment for any reason
  • Other Software Engineers looking to make contributions to the datascience repository for the first time

For new machine learning engineers, we assume you have gone through the other onboarding steps, such as being added to our GitHub Organization and Slack. This entire document is relevant to you.

For existing machine learning engineers who are just trying to fix their corked environment, you probably only need Step 2.

For other software engineers who are just looking to develop and submit a PR, you probably want Step 2 and our GitHub Practices.

Step 1: Enter the Guild Machine Learning channel on Slack

guild_machine-learning is the Slack channel for all machine learning engineers. Join us for discussions on our architecture, helpful updates from our GitHub and custom Slackbot integrations, and memes. Mostly memes. Like this one.

Step 2: Clone and initialize the datascience repository

We’ve attempted to package up as much of the initial setup as we can into a simple shell script. There are however a couple of things that we can’t automate, so we ask you to take care of those yourself.

Here is a step-by-step guide to getting your local environment ready to develop:

  1. Clone the datascience repository:
    • git clone git@github.com:nicejobinc/datascience.git
  2. Install Docker.
    • For OSX, start here.
    • For Linux, start here and find your distribution.
    • If you have a Windows machine, you should start by setting up the Windows Subsystem for Linux. Then, follow the Linux installation instructions for your distribution.
  3. Install direnv.
    • Note the two distinct steps here: first install the direnv package, then hook it into your shell.
  4. In your terminal, navigate to the datascience repository.
  5. Execute: ./.init -v.
    • This will create a Miniconda environment in the datascience repository, which direnv will activate every time you navigate to the datascience repository directory. (It will also de-activate this environment when you leave the directory). This is your development environment. Although most commands will run within Docker containers — in accord with our emphasis on the role of Docker — it is still important to have this Miniconda environment correctly configured.
  6. Execute: ./.init -g.
    • This will prompt you twice to login to Google. The first login will obtain user credentials which will be used when you invoke the cloud CLI. The second login will obtain user credentials which will be used when you invoke any GCP resources from Python, including within Docker containers. Both are important.
    • It will do some additional setup, which we won’t focus on here, in the spirit of not getting lost in the details. The important thing is, the -g stands for “Google Cloud stuff”.
  7. At this point, your initial setup is complete. However, as you proceed to use CLI functions to interact with the repository, you will likely be prompted to build some Docker images. The appropriate command to run in order to do this will be given to you by the error messages you receive when a command you attempt needs an image to be built.

This is all you need in order to start contributing to the codebase. Congratulations on making it through one of the most annoying parts of the job!

Step 3: Examine our template experiments

As you will learn later when we talk about our Experiment Package Structure, our machine learning services are all configured as modules of a Python package, and these modules have a consistent structure.

As a means of developing this structure, we maintain two “example projects”. One predicts housing prices for a fake dataset of house features; another predicts the sentiment of a string of text. Neither is intended to produce a meaningful machine learning service; they are code samples intended to demonstrate the canonical use of our architecture, and to help emphasize user experience as we develop this architecture.

You can find both of these templates at cli/psml/psml/templates/. Take the time to explore this part of the codebase. You will find:

  • Imports from paystone packages, giving a sense for how these packages are used to develop services.
  • The basic structure of model training.
  • How we use FastAPI to serve models.
  • How we organize tests.
  • The hierarchy of model versions and what each level is responsible for.

Of course, these concepts will all be covered in further documentation. This is simply a good starting point as you start to get comfortable with the codebase.

Step 4: Continue with these docs

Now that you’re in our little community, your environment is set up, and you’ve gotten a taste of what our code looks like, you can continue reading to find out more about what we do here.

It would be nice to be able to point to only a couple documents in order to tell you what is actually necessary in order to get up to speed. In all honesty, it's hard to justifiably exclude any pages from that list. However, you may consider the "ML Services" section "optional reading", until such a time as you actually start to involve yourself with developing those services. Still, it's a great read to get an idea of what we've been working on so far!

Finally, if anything in our documentation is unclear, or you think something might be missing, or some step in the setup failed and you can’t figure out why, feel free to bring it up in our Slack channel. It could well be an opportunity for your first PR, where you’ll get hands on experience with our GitHub practices!