Link Prediction¶
In network theory, link prediction is the problem of predicting the existence of an unobserved relationship between two nodes in a graph. The input that the predictor is given to consider is a sub-graph, meaning a subset of the observed nodes and relationships. Prediction routines vary, and may include both machine learning models and heuristic-based rules engines.
Link prediction is widely used in graph databases. It can be used to predict friendships in a social network, co-authorship of papers in a citation network, or even interactions between proteins in a biological network, to name a few use-cases. In some applications, it has a temporal element: given the current state of a sub-graph -- the set of nodes and the set of relationships -- it can be used to predict the set of relationships at a certain time in the future. For example, a temporal social network is one which tracks friendships over time. If at the current time, Persons A B and C are all friends with certain properties and certain other friendships, what is the most likely state of that friend circle in one month?
Temporal or static, model-based or heuristic-based, predicted links create new opportunities for leveraging the power of a highly-connected data schema. In this article, we will explore the value of these connections, explain how the service works under the hood, and describe how consuming applications can interact with the service.
By the end of the article, readers should clearly understand why this service is something we have invested in as an organization.
The Value of Connected Data¶
The Customer Data Platform (CDP) is a unified interface to three databases: transactional, analytical, and graph. Each of these databases is optimized towards a particular view of the data:
- The transactional database provides rapid read and write access to small numbers of records.
- This database is row oriented: it is best at reading single rows at a time.
- The analytical database supports performant reads of large numbers of records from across many tables.
- This database is column oriented: it is best at reading single columns at a time.
- The graph database enables traversing many consecutive connections between entity types.
- This database can be viewed as being value oriented: it is best at reading entities which share common values.
The value of a graph database increases exponentially with the connectedness of the source data schema. This is true both of interconnected and intraconnected entities, and perhaps even more of the latter. To demonstrate this, consider the following example.
A Person is connected to many events in the customer journey: Calls, Bookings, Visits, Reviews, Recommendations, Payments. In a transactional or analytical system, gathering all of these entities for a particular customer journey would not be very efficient. Querying for the second-order connections -- the entities which each journey event is connected to -- is even less efficient. Queries of densely interconnected entity types are exactly what a graph database is designed to support.
Further, a Person is likely connected to other Person entities, such as through Recommendations. This creates a cycle in the schema which becomes extremely difficult to leverage in a relational database, whether transactional or analytical: the number of joins required to collect all Person-to-Person connections is dependent on the values in the data. Queries with this property are nearly impossible to construct. Conversely, this intraconnectedness is again exactly the kind of query that a graph database performs efficiently.
Having shown that connections in a schema increase the leverage provided by a graph database, it is natural to wonder: can we intentionally create more connections in our schema? If we can do that, we increase the utility of a graph database. Indeed, the creation of novel connections in a graph database that do not appear in the source schema is the function of link prediction. Some link types will naturally have more utility than others, but in general, they all increase the overall connectivity of the graph.
Anecdotally, it appears that many of the most valuable companies in the world are beginning to leverage graph databases. However, it is still a very young, rapidly emerging technology. By installing it in our data platform, we have gained a competitive advantage. By developing this service, we increase its utility. Therefore, by developing this service, we increase our competitive advantage.
How It Works¶
Link Prediction at Paystone is still in its infancy. There are many model-driven techniques that will become the focus of future additions to the service, but for now, we prioritize the heuristic-driven approach.
A heuristic-based predicted link is the result of a graph query implementing a function over existing links in the graph, possibly with some post-processing. Critically, heuristic link prediction queries cannot use the properties of nodes, since this is an inefficient querying mechanism, and latency is a priority. Because of this, the development of a new predicted link may require the addition of new relationships in the graph schema which encode relationships between node property values. This means updating the graph ingestion process, which is documented outside of the machine learning hub.
As an example of a class of heuristic link prediction computations, a predicted link may be the union or intersection of a set of existing link types. Concretely, we might consider a predicted link type "IS" between a Person and a Person, which is the intersection of the link types "SAME_NAME" and "SAME_EMAIL".
Predicted links have weights corresponding to their probabilities. A link predicted with probability 1 is the result of a heuristic which we as an organization have agreed is equivalent to objective truth. Many other predicted links will have probabilities less than 1, indicating the likelihood that they represent real-world truth. As we begin to develop model-based predicted links, these in-between probabilities will become more prevalent.
How To Consume This Service¶
For some applications, the links themselves will be the data of interest. In these cases, querying the service directly is obviously sufficient. For other applications, the value of this data will be in the ability to use the connections as part of a larger graph query. In these cases, the response from the service can be converted into virtual nodes and relationships, for use in the larger query.
In either case, the general structure of requests and responses for the service is as follows:
- Each link type corresponds to a 3-part path in the API of the format
/[source_node_type]/[relationship_name]/[target_node_type]
.- For example, the relationship "IS" between "Person" and "Person" corresponds to
/person/is/person
.
- For example, the relationship "IS" between "Person" and "Person" corresponds to
- Each link type has two request types:
- A
GET
request with the source entity ID in theentity_id
query parameter. - A
POST
request with a list of source entity IDs in theentity_ids
field of the request body.
- A
- For each request type, the response is the same.
- The
predicted_links
field contains a list of objects. - Each object has a
root_entity_id
and a list oflinked_entity_ids
. - Each linked entity ID has both an ID and a probability, whose interpretation is discussed above under How It Works.
- A
synthesized_entity_id
combines the IDs of all linked entities into a single, new identifier; the rules for the generation of the identifier differ based on the link type.
- The