Machine Learning as a Second-Party¶
Defining Second-Party¶
What do we mean by “second-party”?
First-party services are those which we create and maintain ourselves that directly drive customer benefits. Our many microservices, such as the Person Service, are examples of this.
Third-party services are those which we invoke to help enhance our products, but we have not created them, nor do we maintain them. An example of this could be our integration with Jasper AI, which we use to help generate marketing content.
So what are machine learning services? Are they services we create and maintain ourselves that directly drive customer benefits? Or are they those which we invoke to help enhance our products, but which we do not create nor maintain?
As an organization, we do not focus on creating machine learning features in our products. We focus on creating features in our products that deliver impactful solutions to customer problems. Some of these features have dependencies on machine learning services, and it is in these situations that we employ machine learning. However, in many of these cases we have decided it best to invest in bespoke machine learning solutions, which requires committing our own engineering resources.
This is how we arrive at the term “second-party”. Machine learning services are second-party because they are created and maintained within the walls of Paystone, but their purpose is only to enhance the first-party services that directly create customer benefits. Machine Learning Engineers develop machine learning services, and the Paystone product line “plugs into” these services as though they came from the outside world.
Implications¶
Taking this view of the role of machine learning within the organization changes a number of things about its development. It changes the processes, the outcomes, and the interactions between machine learning engineers and the broader engineering team. For the remainder of this article, we’ll discuss the practical implications of viewing machine learning as a second-party service.
Developed in Response to the Needs of a Feature¶
We’ve established that machine learning is a solution that we employ in certain situations to provide customer benefits. Items on the product backlog never prescribe solutions. Therefore, items on the product backlog should never prescribe machine learning services.
How, then, does machine learning work come about? Building on what we stated in the Purpose of ML, we should always first evaluate whether the lift that a machine learning solution provides is worth its cost of development. This work can and should be done by machine learning engineers. They are provided the specification of a capability, and their goal is to uncover the best tool to provide that capability. When the engineer concludes that machine learning is the optimal solution, their goal becomes to create (or even identify) a service to provide it. When the engineer concludes that machine learning is unnecessary, they communicate that result and the work is handed off.
Given this framework, machine learning service development begins after the requirements of a new product feature, or enhancement of an existing product feature, are evaluated against the exhaustive set of classes of solutions — deterministic algorithms, heuristics, third-party tools, existing machine learning services, and a new machine learning service — and it is determined that a new machine learning service is required.
Requires Specified Communication¶
Once the needs of a feature have been evaluated by a machine learning engineer and the conclusion is reached that a new service is required, there is a necessary hand-off step. The product has essentially contracted the services of our machine learning team, to help provide a component of a customer benefit. In contracting the services of a team, there is an obvious communication component.
The product leader contracting machine learning services has the most intimate knowledge of the desired impact, the ideal use case, the potential pitfalls and edge cases, and the expected behaviour of the new service. It is very important that all of this information is communicated clearly to the machine learning engineer that is tasked with creating the service. Machine learning engineers certainly have some understanding of the broader picture of the product, as product and engineering work very closely together. But because machine learning engineers never sit directly within a product group, nor spend the majority of their time thinking about the product roadmap, it is unreasonable to expect them to be able to answer these questions entirely on their own.
Hand-offs can be an expensive process, especially if done incorrectly, and should be avoided when possible, but in this case one is necessary. Therefore, we do our best to make sure it happens as efficiently as possible. For this reason, we’ve crafted a Machine Learning Service Specification Template. We’ve made the use of this template a part of our experiment lifecycle to ensure that its value is maximized.
Autonomy¶
The tech stack and processes of application development are optimized for application development. The reality of machine learning is that while it overlaps significantly with traditional software engineering, the differences are so deep within the principles of development that they require an entirely different environment to succeed.
To prove that statement, we can offer a few examples.
The most obvious difference is the language of choice. Python has come to dominate the machine learning space and there is no question at this point that it is the optimal language for developing machine learning services. With that comes a set of tools for development and deployment that is distinct from that of application development. This would create friction were it to exist in the same space, say in the same repository within a version control system.
A primary difference in process is experimentation. Experimentation is at the heart of machine learning development. The process of writing large amounts of code in order to simply validate an idea, code which has no effect on the production system, is largely foreign in traditional software engineering. Also foreign is the need for experiment tracking, so that both failures and successes can be studied and learned from. Testing practices are augmented with statistical validation as well, although this difference can become less pronounced at the architectural level with careful design.
Deployment processes can often differ, and this is based on the existence of a type of software bug unique to machine learning: mathematical bugs. Best practices for deployment in a machine learning context are constantly evolving, but a common theme is that the amount of rigorous evaluation required to implement continuous integration is significantly higher than with traditional software. This is because even subtle changes in the environment of a machine learning service can cause silent mathematical bugs to appear in production. In view of this, changes must be carried out with proper experimental processes, which again includes tracking and statistical validation.
All of these things could reasonably be argued to be simply “add-ons” to traditional software engineering practices, and there may well be a solution to these problems that incorporates machine learning into the broader development environment. It is certainly a goal of the “MLOps” community to continue to assimilate the practices of software engineering and to make machine learning less of an edge case. The challenge is that when we look at where we are along that timeline, things are still very much evolving, to the point that every organization is answering the question of how to develop machine learning software independently for themselves. While we are still in this phase, the machine learning engineers have a responsibility to provide this organization’s answer to these questions, which is why they must be given the autonomy to do so.
Certainly, there is much to be learned from the engineering practices of the broader organization, and this knowledge transfer is wholeheartedly welcomed. Knowledge of software engineering practices can be transferred through documentation, training, mentoring and even code pairing, without tying the two systems together at the code level. To operate this way is to avoid a number of unnecessary headaches, if at the expense of some duplicated efforts.
One-Way Data Access¶
No third-party service that Paystone chose to employ would be given write access to Paystone's data platform. That would be an obvious security risk. Conversely, any first-party service would likely require write access to provide any customer benefit, so much so that this is a core capability shared by all first-party services. Where do second-party systems fall?
From the perspective of security, the fact that the creators and maintainers of second-party services are employed by Paystone means that there is no additional risk in granting them write access. However, that is not the only concern.
Writing data is arguably the most complex process in software architecture. Thousands of lines of code and documentation cover this process, from serialization to validation to broadcasting the writing event. This is undoubtedly one of the key responsibilities of the application, and to circumvent these meticulously crafted processes by opening up a separate "gate" for second-party services to push data through seems fraught with danger. More ink could be spilled to justify why this would be a bad idea, but it seems obvious at this point: there should be one, and only one, means of writing data to Paystone's data platform.
On the other hand, reading data is comparatively straight-forward. A read-only interface to a data source is generally a safe instrument, one where the potential pitfalls and concerns lay almost exclusively on the side of the consumer.
For this reason, we firmly establish that as a second-party service, machine learning has read-only access to Paystone's data. Machine learning services can and should employ their own read-only interfaces to Paystone's data platform, but these interfaces should go no further, and role-based access control should prevent them from doing so.
When machine learning services create data -- typically in the form of API responses containing predictions -- they do so without knowledge of the subscribers. This places solely in the hands of the application the responsibility of converting these responses into product features and customer benefits.
Startup Mentality¶
This is the softest of the implications, but worth mentioning nonetheless.
By framing machine learning as being halfway to an independent startup, we give the team the mentality that Paystone is a client — their most important, and only, client. This helps during prioritization and roadmapping to re-enforce that the most important thing to the team is the success of their client, of Paystone.
Practically, this means a few things. It means that work that only services the team itself can only be justified if its downstream benefits to the client are significant. It means that the APIs the team exposes should be as user-friendly as possible, not relying on communication between the two parties. It means that the team must continue to give Paystone reasons to justify the relationship; that they work with us because they want to, not because they have to.