Defining ML Engineers at Paystone¶

Among engineering specializations, "machine learning" has seen perhaps the most diverse set of treatments across all organizations.

In this article we establish Paystone's view of the machine learning specialization: the terms we use to codify it, the expected skills of its engineers, the responsibilities within its scope, and the principles with which we can decide where new work belongs.

Absence of Specialized Roles¶

In many organizations that leverage statistics and machine learning to extract valuable information from their data, there are a few different data- and machine learning-related titles. These organizations employ Data Analysts, Data Engineers, Data Scientists, Machine Learning Engineers, and Machine Learning Ops Engineers.

At Paystone, we have but one role: Machine Learning Engineer. To fully understand why we have taken this position, it is beneficial to further dissect the roles listed above.

While each organization views these roles slightly differently, a description that captures the majority of cases goes as follows:

Data Analyst
- The Data Analyst's skillset centers around being able to gather, aggregate, analyze, and present large amounts of data from across an organization's many data sources.
- A successful Data Analyst presents complex compositions of raw data in a way that is easily understood and actionable by the audience.
- Their primary audience is decision makers within the organization.
Data Engineer
- The Data Engineer's skillset centers around being able to expose an organization's many disparate data sources in a highly accessible and highly efficient way.
- A successful Data Engineer synthesizes all of their organization's data into an easy to use, practical interface that meets the latency and throughput SLAs of the data's consumers.
- Their primary audience is other engineers, mainly the others in this list.
Data Scientist
- The Data Scientist's skillset centers around using their mathematical proficiency and command of the scientific process to produce the statistical models that become machine learning services.
- A successful Data Scientist uses any and all available algorithms and techniques to produce the optimal model for a given problem.
- Their primary audience is machine learning engineers.
Machine Learning Engineer
- The Machine Learning Engineer's skillset centers around using strong software engineering principles to bring the models produced by Data Scientists into production.
- A successful Machine Learning Engineer deploys a variety of statistical models with consistency, reliability, security, and efficiency.
- Their primary audience is the consumers of machine learning models within the organization, typically application developers.
Machine Learning Ops Engineers
- The Machine Learning Ops Engineer's skillset centers around building tools and rules that enforce a high standard of quality for all machine learning systems in production.
- A successful Machine Learning Ops Engineer has enforced practices and provided tools that make machine learning production systems safe to deploy, enable meaningful iteration on their performance, and increase the productivity of Data Scientists and Machine Learning Engineers.

One might start to take from this discussion that we are attempting to employ only "unicorns": that we have amalgomated five roles into one, and expect the one person to have the capability and the time management to be able to do all five jobs simultaneously. This is not the case. Going back over the 5 roles mentioned above, we see how Paystone organizes these responsibilities such that no one person is overly burdened with either jobs to do or expectations of an impossible skillset.

Data Analysis as a Function of Business Intelligence¶

We are very fortunate at Paystone to have a mature Business Intelligence team. Business Intelligence as a discipline is about delivering insights from an organization's data to its internal teams, so that data-driven decisions can be made to effectively grow the business. This aligns with the definition of Data Analysis we gave above. They are the most qualified people in the organization to solve these problems.

In the absence of dedicated Business Intelligence engineers and analysts, these responsibilities would fall on the shoulders of the next most qualified data wranglers, which would be the Machine Learning Engineers. This is the case for many organizations, but fortunately, Paystone is not one of them.

Because of this, we can effectively eliminiate this from the list of responsibilities for Machine Learning Engineers.

Data Engineering via the Customer Data Platform¶

The problem that data engineering solves is essentially that it is very difficult to get access to all of the relevant data for a given problem in one place.

A unified point of access for all of Paystone's data is the defining characteristic of the Customer Data Platform.

The development and maintenance of the Customer Data Platform is not the responsibility of one team; it is a collective effort receiving contributions from engineers across the organization. As such, it does not mandate having a Data Engineer on board. The systems that govern the ingestion of data, and the interfaces by which interested parties consume data from the "CDP" are at the core of Paystone's engineering.

Data Science and Machine Learning Engineering as a Unified Lifecycle¶

Creating a customer benefit is never the result of the efforts of a singular engineer, or a single team of engineers. Along the way to creating the benefit, there are meeting points between teams. These points exist at the boundaries between the skill sets of the teams. Once a team reaches a point that they no longer possess the skills to continue carrying a customer benefit forward, they must hand off the benefit to another team.

A goal of any successful organization is to reduce the amount of friction at these meeting points, to increase the velocity of delivery for customer benefits. The minimal amount of friction, logically, is zero, which occurs when there are zero handoffs in the delivery of a benefit. For example, the organization of Paystone engineers into Product Groups has reduced handoffs by ensuring a collective skill set among Product Groups that covers a large surface area.

However, some things require handoffs. For example, design is a skill set somewhat orthogonal to engineering, but it is a necessary starting point for many engineering projects. Because of this, many tools and specifications have been developed to lend structure to this handoff process. There are common formats and a common language at the intersection of these two disciplines. This structure has greatly improved the efficiency of collaboration and reduced information lost in the cracks.

Treating machine learning as a [second party] (/concepts/second_party), for all its benefits, also creates a meeting point between parties. This is why we have made an effort to structure the handoff by templating the specification.

Are the skill sets required by Data Science -- experimenting with and developing statistical models -- and Machine Learning Engineering -- deploying machine learning services in production -- such that they require separate owners, with a meeting point between them? No. Why?

Because strong software engineering fundamentals are expected of all Machine Learning Engineers. This is not the case in all organizations; often, Data Scientists are expected to "live in notebooks", where their work begins and ends with the experimentation process, and Machine Learning Engineers begin with these artifacts. While in some disciplines where extreme specialization is necessary, such as robotics or embedded systems, this may be justified. Paystone is not in such a position, and we can leverage that fact to eliminate a hand off. At Paystone, Machine Learning Engineers own the entire lifecycle, from experimentation to monitoring in production. This has the added effect of giving each engineer an increased sense of ownership and autonomy. It also allows stronger interaction between the information obtained at each stage in the lifecycle. When the same engineer owns model architecture and deployment, lessons from deployment can easily inform future architecture, and vice versa, without a cumbersome communication medium.

This does place something of a burden on our hiring process; it is admittedly rare to find an engineer fluent in machine learning concepts that also possesses strong fundamentals. However, Paystone has always had a high bar for its engineers, so this is in line with our culture. We do our best to help our engineers succeed in this environment by creating a helpful development environment and maintaining good software engineering practices as a team. This happens through tools and documentation.

MLOps Projects as Siblings of ML Projects¶

It is a core principle of Paystone engineering that the current role of an engineer should never restrict them from pursuing work that they are interested in. Given the requirements we just established for Machine Learning Engineers, it is logical to assume that many of these engineers will at points show interest in the Ops side of Machine Learning. These requirements also imply that they will be well qualified for this work. Therefore, there is no reason to prevent any Machine Learning Engineer to take on this work. Given its breadth of impact, both the process and code should be reviewed by a senior member of the team.

Because there is complete overlap between the available engineers for both ML and MLOps projects, they should be prioritized jointly.

Expected Common Skills¶

There are certain universal skills of software engineers that we place special emphasis on for Machine Learning Engineers. These things are not unique to machine learning, but they are the things we look for, and emphasize in professional development for all Machine Learning Engineers.

Software Engineering Fundamentals¶

We have established above why strong software engineering fundamentals are required for our definition of a Machine Learning Engineer. What exactly are strong fundamentals, and how do they come into play in machine learning?

A strong software engineer writes code that can be maintained by anybody on their team. This means writing clear, easy to follow logic that is well factored and well commented where necessary. Simple counterexamples of this would be a single function that has multiple separate branching points with deeply nested logic, a convoluted one-line data transformation containing too many steps for a single line, or a strange-but-necessary pattern that goes uncommented so that no future reader can understand why the pattern was implemented. Our Style Guide covers many of these things, and the basics of readability are covered by tooling, but having these principles ingrained makes for a strong software engineer.

While familiarity with statically typed Python is not a hiring requirement, it is a necessity to interact with the machine learning codebase. It is certainly a skill that can be learned; the earliest of Paystone's machine learning engineers began with no familiarity with static type checkers. However, developing a command of this aspect of software demands patience and rigour. It is very easy, and could be very tempting, to escape the clutches of a static type checker by using relaxed types. For example, our tooling does not restrict the use of the type "Any", because there are situations where it is needed. However, "Any" can be used to essentially ignore type checking in situations where it may become quite complex to keep track of. A strong software engineer is not satisfied with escape hatches; they deepen their understanding of the type ecosystem and the problem at hand, even re-factoring if necessary, until the most detailed and accurate typing is represented.

Data structures are also very important to machine learning code. Using the wrong data structure can have a substantial effect on its runtime. This is because in machine learning, with both training and serving, you are dealing with large amounts of data at a time, meaning the same operations are run many times over. What would be a relatively small increase in latency in the context of a simple scalar transformation could be multiplied by thousands of times when dealing with an entire dataset. The most common example of this is "vectorization". Vectorization refers to the subset of the API of scientific computing libraries in Python whose implementation is contained in lower level languages like C and Fortran. These methods are highly optimized and run far faster than anything that could be written in pure Python. Understanding how to "vectorize code" -- how to take a loop in Python and represent it as a combination of transformations from this set of methods -- can often lead to speedups in the 100x range for machine learning code.

This is not an exhaustive list of the qualities that make up fundamentally sound software engineers in machine learning, but it serves as a representative sample of the complete picture.

Conceptual Modeling¶

This does not refer to the ability to architect and train statistical models. Conceptual modeling is a softer, but arguably far more valuable, skill.

To conceptually model a problem is to:

Understand the downstream effects of a solution to the problem, at least one but preferably many steps ahead. This means being able to evaluate a solution not just in terms of its immedate effects, but its global impact on a system.
Understand the upstream processes that create the environment in which the problem exists. This means taking a wide-angle perspective of a system and identifying the role of each actor in that system.
Decompose the upstream processes into their mutable properties, their immutable properties, and the actions that can be taken to change the mutable properties. This skill is invoked in engineering machine learning features, as well as creating services with actionable outputs.
Establish the correct metrics by which to measure the performance of a system according to different subjective understandings of performance.

A strong conceptual model of a problem naturally surfaces actionable opportunities within the system. When one understands the dependencies between processes, the high leverage processes tend to become obvious. When one also understands the makeup of these processes, identifying the high leverage process consequently identifies the particular actions that can be taken to alter the desired metric.

This is all very abstract so far, so let's take a concrete example that invokes all that we have talked about. Let's talk about Youtube's video recommendation system.

A machine learning engineer with only the most basic conceptual modeling abilities might rather quickly arrive at a formulation for this problem. They see it as a simple ranking problem: given a set of videos, the goal is to assign each video a score such that the user is more likely to click on videos with a higher score.

Conversely, a machine learning engineer possessing the conceptual modeling skills we require at Paystone would look at the problem differently. They would demonstrate the four qualities we listed above:

They understand that a system which automatically recommends content to people deeply impacts the livelihood of the content creator, the enjoyment of the user, the time spent by the user on the platform (and therefore the company ad revenue), even potentially the mindset of the user over time, just to name a few things.
They understand the scenarios in which a user might arrive at a point where they receive recommendations from the algorithm, the kinds of users that are most impacted by the recommendation algorithm, and the ways in which the existence of the recommendation algorithm influences the creation of the content available to it.
They can decompose users (gross) and content into their mutable properties and their immutable properties, and they understand what actions they are able to take in order to affect the mutable properties. An example could be the thumbnail used for a piece of content, assuming it is mutable, where they understand what the interface for changing that thumbnail looks like from the system's perspective.
They intimiately understand Youtube's role in the lives of people and in the company's success, and can distill this understanding into a set of measurements for the system that accurately and proportionately reflect the interests of both parties.

Using this complete understanding of the problem, they can develop a system which goes beyond simply optimizing a mathematical metric. They can develop a system that contributes significantly to the growth of the platform and which even improves people's lives.

The value of this skillset cannot be understated. One who masters mathematical modeling can perform their job well. One who masters conceptual modeling can help transform a product.

Communication¶

Because machine learning is a specialization requiring an amount of research and experimentation arguably unmatched by any other software engineering specialization, there are multiple points at which strong communication skills become necessary.

First, the amount of new knowledge that is accumulated during the machine learning lifecycle is substantial. Most problems that machine learning solves have virtually unknown solution spaces, and the discovery process here is long and winding. There are certainly many failures along the way, and often the solutions arrived at invoke many complex ideas. Were all that new knowledge to remain inside the head of the engineer that created the service, we as an organization would be at a massive risk of losing proprietary knowledge. Proprietary knowledge is a critical competitive advantage. For this reason, clear and rigorous documentation that can be readily understood by the rest of the team even on into the future is a requirement of all machine learning development.

Second, the requirements of machine learning services are often loosely defined and abstract. Because of this, there must be a symbiotic relationship between Machine Learning Engineers, product leaders, and other engineers wherein the expertise of all three is shared amongst the group. It is the responsibility of the Machine Learning Engineer in this setting to clearly communicate the capabilities of machine learning to the rest of the team, because often this is unknown to others. Strong communication skills are critical because they not only prevent us from pursuing a fruitless course by misunderstanding the boundaries of machine learning, but the Machine Learning Engineer may often be able to identify opportunities that others in the group did not realize were available. In these situations, the burden is on that engineer to clearly communicate the opportunity.

Third, machine learning really is a team sport. There have been and will continue to be many times where an engineer is stuck learning a new concept, or finding the right way to model a problem, or has found something new that they really want to discuss with the group. These are all opportunities for the team to benefit from each others' work personally and become better individual contributors, as well as to solve problems that cannot be solved individually. This will be helped greatly by each engineer possessing the soft skill of effective communication.

Unique Skills¶

Some skills that are required of Machine Learning Engineers are unique to machine learning. Because these things are not expected broadly across the engineering team, they become the primary value add of having specialized engineers in machine learning.

We'll skip the obvious first skill, which is a command of machine learning algorithms and techniques. This is fairly obvious.

Statistical Validation¶

It is a common refrain that it is very easy to lie with statistics. Those without a strong statistics background can often do so unintentionally. Statistics can be a minefield, with potential for silent errors in many places. A Machine Learning Engineer who is strong in this area uses statistics to accurately uncover the truth in situations where it is not obvious.

This skill is unique to Machine Learning Engineers not only because of their background and knowledge, but also because of the tools available within the machine learning ecosystem. Because so much of the machine learning lifecycle is founded on statistical validation, tools focusing on this capability are plentiful, and readily accessible to Machine Learning Engineers in a way that they are not for others.

There is a strong coupling between statistical validation and the experiment lifecycle. A project which requires statistical validation necessarily invokes the experiment lifecycle, because this lifecycle is what codifies statistical validation.

Metric Selection¶

As was touched on under Conceptual Modeling, selecting the right suite of metrics to measure what is actually important for a solution is a subtle art. It requires an intuition about the true meaning of data and the relationships between the real-world entities that data captures. This sort of mindset is developed most effectively through years of experience developing systems that interact with the world, which is what Machine Learning Engineers do.

This skill goes hand in hand with statistical validation. To statistically validate a solution is to apply metrics to measure its effectiveness, and analyze those results. Selecting the correct metrics is the first step of that process.

Responsibilities¶

Given all of the above, as well as what we know of our position on machine learning as a second-party, how do we know when a given project falls under the responsibility of the Machine Learning Engineers?

Internal Responsibilities¶

To get the obvious out of the way so that the rest of the discussion can be more focused, there are some clear-cut cases here.

A service that is required to support internal functionality of the machine learning team is the responsibility of that team.
- An example of this might be an automated messaging service that communicates the status of long-running training jobs to the machine learning team.
- This follows from the logic of machine learning as a second-party, where we have established a culture in which the Machine Learning Engineers have formed what is essentially a small organization within Paystone. Under this framework, they are logically responsible for the internals of their organization.
The machine learning team may choose to expose any internal services which would have utility as part of their "public API", or their services consumable by Paystone products.
- There may be intermediate outcomes as part of the experiment lifecycle which have enough utility to other consumers that they justify the investment required to expose them publicly.
- For example, machine learning features are valuable data transformations for building services, but they are also often valuable data transformations for the application to leverage. So, it is profitable to view it as part of the collective public API that the machine learning team exposes.

With these cases out of the way, we return to the question from the restricted perspective of projects that originate from the roadmap of the Paystone product line.

External Responsibilities¶

Fortunately, the preceeding section on unique skills actually gives us nearly all of the background we need in order to formulate an answer to this question. Let's go through the logic of this answer one step at a time:

One of the unique skills of a Machine Learning Engineer is statistical validation.
Statistical validation requires metric selection, which is another unique skill of Machine Learning Engineers.
A process which requires statistical validation requires experimentation.
A process which requires experimentation submits to the experiment lifecycle.
The experiment lifecycle ends with the creation of a machine learning service.
Therefore: if a service requires statistical validation, then Machine Learning Engineers are uniquely qualified to build it via the experimentation process.

We can use this line of reasoning, each step of which has been justified over the course of this and other articles, to make the following reduction.

The answer to this question:

Should this project be the responsibility of the Machine Learning Engineers?

Is equivalent to the answer to this question:

Does this project require statistical validation?