Problem Description¶
Customer Match is a feature that allows you to upload reviews from your customers and match them to your existing customer database. This allows you to identify customers who have left reviews and to use that information to improve your customer experience.
Any machine-learning driven solution to this problem must deal with an egregious class imbalance. For the mean company, each review can be matched to roughly 1000 customers. This means that for every positive example (a review that matches a customer), we have 999 negative examples (reviews that do not match a customer). For larger companies, this problem explodes to 1 in 100,000+. Many models struggle to learn from such a skewed dataset in isolation, and so we must design our solution to be robust to this imbalance.
Additionally, at time of writing, our feedback process for customer match tends to result is a non-negligible number of false positives and false negatives. Clients will occasionally mark correct matches as incorrect, and vice versa. The feedback that was collected can only ever be collected on reviews that were not auto-matched. This means that any dataset that relies on feedback is based on samples that reflect the difficulties of the model that performed the given match. Care must be taken to ensure that any solution does not overfit to this feedback.
Experiments¶
Version 1¶
Solution Statement¶
Given the highly imbalanced nature of the problem (finding one correct candidate in a pool of thousands), we use a combination of several systems to generate good results. The systems used in this iteration are
- A proposal system to supply potential candidates for a given review.
- A ranked rules engine that provides some simple rules to rank likely matches.
- A very fast, high-recall classification model to rule out obvious non-matches.
- An F-Score balanced classification model to sort the remaining candidates.
Data¶
Collection and Generation¶
The training data is collected from CONFIRMED reviews. These are reviews in which the client has explicitly selected the correct customer. For every review, we generate a series of [(x1, x2), y] arrays where: * x1 is a review * x2 is a candidate proposal * y = 1 when the match is correct, 0 otherwise
False pairs are generated randomly at a realistic ratio to true pairs (1000:1). Experimentation revealed that balancing the dataset prior to training lead to poor performance on this task.
Test Data is generated as a 20% split of the total dataset. 5-Fold Cross-Validation of the remaining 80% is used for hyperparameter tuning operations.
Features¶
The following features are used in the modeling process. For brevity, assume all pairs of (x, x) represent (Review x, Candidate x).
- Partial Levenshtein Ratio - best-substring edit distance between word pairs over the length of the longer word.
- (first name, first name)
- (last name, last name)
- (whole name, whole name)
- (whole name, email address)
- Metaphone Levenshtein Ratio - total Levenshtein distance between a phonetic representation of word pairs as generated by the metaphone algorithm over the length of the longer phonetic representation.
- (first name, first name)
- (last name, last name)
- (whole name, whole name)
- (whole name, email address)
- Initial Matching
- (last name, last name)
- (whole name, whole name)
- PersonActivity created events aggregated over the following periods. PersonActivity data represents certain customer journey behaviours that a candidate has exhibited prior to the review. These include such items as PaymentCreated, ClickCreated, VisitUpdated, etc. The periods calculated are:
- # Within the Last Year
- # Within the Last 30 Days
- # Within the Last 7 days
- # Within the Last Day
- # Within the Last 6 hours
- # Within the Last 2 hours
In cases where any member of a distance measure pair is missing, that feature is set to 0 (maximum difference). This discourages the model from matching on missing data.
Models¶
Proposal Mechanism¶
For a candidate to be proposed as a potential match, it must meet the following basic conditions: 1. The candidate belongs to the same company as the review. 2. The candidate was created before the review.
All candidates meeting these conditions are passed into the next phase.
Ranked Rules Engine¶
There are 4 ranks available in the rules engine to sort candidates by likelihood. In descending order of importance:
1. The Review E-Mail matches the Candidate E-Mail. This is automatically considered to be a perfect match, and we immediately return the result.
2. The Review Name matches the Candidate Name, and the first and last name are present and have at least 2 characters in both the Review entity and Candidate entity. This is often considered a perfect match, but we defer to downstream models to tiebreak if necessary.
3. No rules match - this rank defers candidate selection to downstream models.
4. Initials non-match. This rank is only assigned if either the Review author name or Candidate entity name consists only of initials. An example might be:
* Review Name: MJ, Candidate Name: Don Jeeter
In situations where one of the choices is only initials, and the other one does not have matching initials, the match likelihood is reduced. Downstream models will de-prioritize candidates assigned this rank. If the initials do match, the candidate is assigned a rank of 3.
Stage 1 Model¶
The first stage model is a very fast, high-recall Lasso regression model. It is designed to quickly rule out obvious non-matches. Features used by this stage are: 1. Partial Levenshtein Distance 2. A PCA-reduced version of the PersonActivity features (10 components).
Levenshtein Metaphone Distance was not used as the metaphone algorithm for phonetic extraction is very compute-intensive, and so it was reserved to be used only on candidates that remained post-filtering.
The model was tuned using a 5-fold cross-validation process on the training set with a focus on preserving high recall so correct candidates were not lost (roughly 99.5% recall achieved).
Stage 2 Model¶
The stage 2 model is an F-Score-balanced Boosted Tree Ensemble. During training, this model was trained on all samples that were not filtered out by a fully trained stage 1 model (resulting in a roughly 1:30 imbalance). During experimentation, it was found that this approach was superior to training the model on the full dataset, as it allowed the model to focus on the most difficult cases, and provided a much more normal distribution of match probabilities (which is helpful for sorting candidates by predicted class likelihood). In contrast, training on the full dataset resulted in a very bimodal distribution of match probabilities, where the majority of candidates were assigned a probability of 0.0 or 1.0.
The model was tuned using a 5-fold cross-validation process using F1-score as the primary evaluation metric. The model is trained using a concatenation of all features discussed in the previous section, with no feature reduction or extraction.
Perfect Match Threshold¶
The perfect match threshold is a value that is used to determine whether a candidate is a "perfect match". This term does not refer to absolutely certain matches, but matches that are confident enough that we don't ask the user for input and auto-match instead. If the top-ranked candidate has a Stage-2 probability of 0.5 or higher, it is auto-matched. Otherwise, the top 3 candidates are presented to the client as suggested matches.
In cases where stage 1 filters out all candidates and the rules engine does not give us a certain match, we never return a perfect match.
Serving Logic¶
When a request comes in, the following steps are taken: 1. The candidate proposal system is used to query for a list of potential candidates. 2. The ranked rules engine is used to rank the candidates. If the top-ranked candidate is a rank-1 match, it is returned immediately. 3. The Stage 1 model pipeline is used to filter out obvious non-matches. Results of stage 1 are kept as a tiebreaker system in the event we have multiple top-ranked candidates. 4. The remaining candidates are passed to the Stage 2 model pipeline. If all candidates are filtered, this step is skipped. 5. Predictions from the rules engine, stage 1, and stage 2 are combined to produce a final ranking of candidates. The final ranking prioritizes the rules engine ranking, followed by stage 2 likelihood, followed by stage 1 likelihood. This is done to ensure that the rules engine is always the final tiebreaker. 6. The perfect match threshold system is used to determine whether the top-ranked candidate is a perfect match. If it is, it is returned immediately. Otherwise, the top 3 candidates are returned to the client.
A diagram of this process is shown below:
Results¶
The entire end-to-end system was tested using the the extracted test dataset. The results are shown below: * 81% perfect match precision and recall * 30% candidate top-3 accuracy on imperfect matches
Notes: * If we desire to prefer either precision or recall, we can prioritize it by adjusting the perfect match threshold. * Imperfect matches often omit the correct candidate, which explains the low success rate in that metric.
Potential Future Directions¶
There are several areas where this system could be improved in the future. These include: 1. Data reliability improvement. We discovered that our selection method for training data was not ideal. Clients were not always engaging with our feedback mechanism in an appropriate way, which resulted in a fairly large number of false positives in our training data. This could be improved by improving our feedback mechanism, or by manually curating a large number of training samples. 2. Better journey data. We use several time-sliced aggregation counts across several journey features in our training data, but exact ordering of these events is not captured, and event inter-relationships are ignored (though presumably some of this information is learned). There are other representations of journey data that could be used to improve the model, such as event sequences, or a graph representation of the journey. 3. Content Extraction. We did not use review content in our model, but it is possible that we could use it to improve our results. We could extract employee names and determine if that employee is associated with the candidates, or extract product names and determine if the candidate has engaged with that product.