7

Bridging Between Worlds: A Collaborative Story of Data and Decision Science

 2 years ago
source link: https://medium.com/paypal-tech/bridging-between-worlds-a-collaborative-story-of-data-and-decision-science-29dbfe4ed2e7
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Bridging Between Worlds: A Collaborative Story of Data and Decision Science

Photo by Modestas Urbonas on Unsplash

It’s very clear that collaboration and diverse opinions are essential drivers for innovation. After all, it’s no coincidence “Collaboration” is one of PayPal’s core values and part of our mission statement. In this article, we wanted to share the story of how a three-month collaboration between a data scientist and a risk decision scientist yielded amazing results that were only possible thanks to the two joining forces.

In early 2021, Adam Inzelberg joined the Horizon team in the Global Data Sciences (GDS) organization as part of his first rotation in PayPal Technology Leadership Program (TLP). TLP is a 2-year rotational program across diverse roles, departments, and sites throughout PayPal, aimed at enhancing leadership skills and technological and cultural experiences. 10 PayPal employees are selected every year.

Before joining TLP, Adam spent years developing and managing risk strategies across multiple domains in PayPal. During this time, he gained knowledge of common fraud stories, business understanding of the balance between fraud prevention and good user experience, and smart decision making.

Adam was assigned to work on a proof-of-concept project with Ilan Voronel, a senior data scientist in the Horizon team, a team that builds horizontal data science infrastructures and solutions serving all data science teams in PayPal including Risk, Product, Marketing, and others. Among other things, Ilan has vast experience in large-scale clustering solutions, which were crucial for this type of project.

What were we trying to solve?

Fraud prevention is going through an ongoing evolution. Similar to how our prevention tools become more complicated, fraudsters develop new and innovative solutions with the same complexity. We need to be ahead of the curve to keep the high-level wall of defense against fraud that PayPal is known for.

A problem we often see in this field is scaled attacks on our systems, where users create groups of accounts with shared characteristics and account behavior. While our solutions do catch most of them, there are still those tricky accounts that go under our radar. In this project, we wanted to see whether using new untested methods could yield results that we were unseen until now in PayPal.

After many discussions and brainstorming sessions, we came up with 2 methods:

1. Supervised Graph Linking Based Clustering

2. Unsupervised Clustering Approach

Method 1 — Supervised Graph Linking Based Clustering — Adam Inzelberg

This new method leverages an in-house built graph-linking-based solution that allows us to connect between accounts based on various parameters. This is a robust solution with very high accuracy that’s being used across multiple domains in the company. In its current state, one of its limitations is its number of hops, meaning how many link layers it can build for a given account.

The amount of data we receive for each additional hop grows exponentially, to the point where after the second hop the data is extremely large and cannot be evaluated in a normal timeframe.

As an example, let’s say each account in our database is linked to a few hundred accounts, and we evaluate just two accounts. The first hop will be a few hundred, but the second hop could already reach a few hundred thousand. The third hop will be enormous, and we obviously have more than two accounts to evaluate → it’s not scalable.

Therefore, we needed to find a way to get to N number of hops (N>2) but have reasonable computation times so it can be leveraged in a real-world setting. The way we managed to significantly reduce the amount of time it takes to compute, while not comprising on the number of hops, is just by filtering out insignificant data. In this case, it’s all accounts we don’t think are fraudulent.

To predict bad account links, we built an XGBoost classifier using the graph linking features, as well as other soft linking features which are known to indicate fraudulent activity. We trained the model on two different distant timeframes and an in-house built tag. From there, we received a long list of account pairs we predicted to be bad. This reduced the amount of data substantially while keeping the vast majority of our target population.

Finally, we could build our clusters, using a recursive methodology to do so. Here is an example of how it works:

We have a list of accounts, and the accounts are linked to each account. Let’s say Account A is linked to B and C, Account B is linked to D, and E, and Account E is linked to F, and G.

In this case, we get a family tree of seven accounts linked by three layers (hops).

So, let’s generalize the process: We start by drilling down to the highest “letter” account located in the N layer. Then, we collect all linked accounts from the bottom of this family tree. We will always have a final N layer as we use a date limit as the stopping mechanism.

The result is one big cluster of accounts with a high probability to be linked and to be fraudulent.

In the real world, our clusters are much more complicated, consisting of many more layers. This method was found to be relatively quick, and it can handle the amount of data we have in our systems. We will share the results in the last paragraph of this article.

1*kY56e3E-Tsr2G6prpkVeqQ.png?q=20
bridging-between-worlds-a-collaborative-story-of-data-and-decision-science-29dbfe4ed2e7
Figure 1. N-Hop recursive clustering

Method 2 — Unsupervised Approach — Ilan Voronel

Here we wanted to try a moonshot solution and go for more innovative and out-of-the-box thinking. Rather than a supervised method that is commonly used in fraud prevention solutions, in this case, we wanted to see whether an unsupervised solution can yield great results.

In this solution, we used two unsupervised models, PCA for dimensionality reduction and DBSCAN for clustering the points to clusters.

In the first phase, after cleaning the data, we transformed the data into numerical-only data as the input for the dimensionality reduction model, PCA. We started with ~100 features and reduced it to a three-dimensional vector.

In the second phase, we tried to cluster these three-dimensional vectors into groups. Since we didn’t know how many groups we had and what their sizes were, we used DBSCAN algorithm. The DBSCAN algorithm is very useful due to how it builds clusters. It does not require a predefined number of clusters and will not automatically connect data points if they are not useful for cluster creation. In our case, it was extremely useful as we didn’t know which data points were relevant for the cluster creation.

Now we have cluster assignments for the data points. At first, we evaluated the clusters using Silhouette and purity scores. Silhouette is used to determine tightness and separation of clusters and purity helps us understand how well each of the clusters is homogenous in the tagging we had.

Once we reached a good enough result, we started creating clusters level features. Here we tried to simplify things and just do simple aggregations for each of the clusters, such as sum, average, median, etc.

As a final stage, we trained a GBT classifier using account and cluster level features to detect bad accounts.

1*Zj40HMcRXoSZE1-p5jv2rg.png?q=20
bridging-between-worlds-a-collaborative-story-of-data-and-decision-science-29dbfe4ed2e7
Figure 2. 3D unsupervised clustering using DBSCAN

Results and Summary

Before diving into the results, we think that this project proved that diverse experience, knowledge, and opinions are invaluable. Throughout the project, Ilan used his experience to guide which data science models we should use, how to use PayPal’s integral big data systems, and made sure we were using the right practices in developing and measuring our methods. Adam used his experience in indicating what data points were crucial for detecting fraud, what tags we should use, and what metrics we should evaluate to report our success from a financial standpoint.

The results of this project are extremely promising. Both methodologies proved to find clusters of fraudulent accounts with very high accuracy, precision, and recall. They will be used moving forward to strengthen our defense line and keep our high standard of customer experience.

This work builds an appetite for other collaborations. What next? Marketing and data science? Risk and product? The combinations are endless. At the end of the day, such initiatives trickle down to our customers and improve their experience. Since we are committed to a customer-first approach, trying such collaborations is essential for that purpose.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK