How Machine Learning helped power VTS Tenant Network Services

(Special shoutout to Carlo Bailey, who was the original author and data scientist for this initiative)

VTS’ mission is to be the number one commercial real estate platform where industry professionals operate and make data-driven decisions. Today, we have over 12 billion square feet of real estate on our platform, with thousands of the top landlords and brokers using our platform to streamline their operations. As such, our VTS Lease product has millions of tenants entered by thousands of landlords across properties worldwide.

With this data in hand, we can help our customers understand how real estate demand is evolving and highlight the existing tenant relationships they may already have. These two insights were fundamental in building out VTS’s recently launched Tenant Network Services on VTS Lease.

Achieving that required overcoming a significant hurdle: resolving tenant records from the hundreds of different technologies we integrate with. Here’s an example:

One landlord might have a potential lease with a company called Acme, Inc., leasing property in Queens
Another landlord might have a potential lease with a company called Acme.com in a property in Manhattan entered by a different broker.

Are these companies the same? Do they represent the same demand? Our job is to determine whether these two Acme Inc variations refer to the same company. This process is often known as entity resolution, which involves matching records across disparate datasets that refer to the same entity. This blog post will outline our solution to automate this process using machine learning.

What is Entity Resolution?

Entity resolution is a common problem in the data world. It involves reconciling entities across multiple data sources that share some common identifier (e.g., the same name but with minor spelling differences, location, URL, etc.). In the below example, we have three different records for Amazon, Inc. The entity resolution process would help us identify that these three different versions refer to the same company. However, technically, Amazon Digital Services is a subsidiary of Amazon, Inc.

There are many names for reconciling entities (e.g., record linkage, deduplication, fuzzy matching, named entity linking, etc.). There are also many varying approaches to entity resolution, each with its advantages and disadvantages:

1) Rules-based

This is the most straightforward approach to entity resolution, where links are created across records using some pre-established business logic. Records are matched via a deterministic record linkage procedure if all or some identifiers (above a certain threshold) are identical.

A deterministic record linkage is a good option when a common identifier exists or when there are several high-quality representative identifiers (e.g., name, industry, location when identifying a company).

2) Fuzzy matching or approximate string matching

This method seeks to find pairs of text that match approximately rather than exactly. This is a common problem within computer science, with many well-known algorithms to solve it.

One of the most used algorithms is Levenshtein distance, which involves measuring the number of edits it would take to turn one string into another. This metric could augment a rules-based approach to establish weights or thresholds to decide when two records are a match. Algorithms range from those that focus on formal similarity (e.g., Levenshtein Distance) to those that focus on semantic relationships (see diagram below).

3) Machine learning & natural language processing (NLP)

More recently, varying machine learning approaches have been used for the process of entity resolution. It involves training an algorithm with historically verified human matches to create a confidence score of two records being a good match.

Our Approach

Recognizing the advantages and pitfalls of each classic entity resolution technique, we chose to take a hybrid approach when tackling entity resolution. Given that we needed our system to be (1) highly accurate, (2) scalable (as we’re handling millions of records), (3) fast, and (4) tolerant of variable data quality, we chose to pursue an approach combining machine learning, fuzzy matching, and business logic. Our methodology can be broken down into four steps:

Step 1: Tokenization

We first convert the record’s strings (whether tenant name, industry, market, URL, etc.) into a normalized sequence of text. This is a common NLP task that involves converting strings to lowercase characters, splitting words on empty space, lemmatizing, converting to N-grams, etc., to make the characters more machine-readable.

Step 2: Blocking

In reality, VTS is attempting to match thousands of tenant records with millions of existing companies. This leads to billions of possible pairwise comparisons between records to find a match, which is computationally infeasible! The blocking step reduces the number of comparisons by using cosine similarity to only keep records within the same “block” (i.e., keeping the N most similar records). We split tenant strings into high dimensional trigram vectors and compute the pairwise cosine similarity between vectors.

Step 3: Feature extraction and fuzzy matching

Once similar records are identified, we run a series of string matching algorithms to identify tenant records that have highly similar names, matching industries, similar locations, and other attributes. There is a lot of nuance in this step as things like industries are often not consistent in naming convention (e.g., technology and software could refer to the same industry). Therefore, we use fuzzy matching algorithms and language models to identify semantic similarity.

Step 4: Machine learning

Finally, we use thousands of human-verified tenant matches (shoutout to our Data Ops team!) to train a machine learning classifier to predict the likelihood of a match between two matches. Given the amount of data VTS has in our system, we were able to produce a model with high precision and recall.

Leveraging AWS Sagemaker

Given the amount of data in our system and the need for a globally scalable service, we leveraged AWS Sagemaker to build an automated machine learning pipeline that distributes the tasks above across multiple machines in the cloud. Check out our AWS blog post with Provectus to learn how we scaled our machine learning infrastructure to bring this tenant linking model into production.

Final Takeaways

While we faced some operational and technical hurdles along the way, the tenant linking pipeline is now being used by stakeholders across the company and has immediately impacted our business.

Having clean, deduplicated records of tenants provides landlords with immense insight into the operations of their physical spaces and a complete view of their tenant relationships (check out Tenant Network Services to learn more!)
The automated pipeline increased the velocity of linking tenants across data sources, traditionally a somewhat manual process involving hundreds of contractors. With this tool, our data operations team can quickly link tenants with very little oversight, allowing us to scale as our business grows.

Are you interested in tackling problems like this? Consider joining the VTS Data Science team! We are looking for talented data scientists to join us. Feel free to check out our careers page for openings.

How Machine Learning helped power VTS Tenant Network Services

How Machine Learning helped power VTS Tenant Network Services

What is Entity Resolution?

Our Approach

Leveraging AWS Sagemaker

Final Takeaways

Recommend

确保 Kubernetes 安全合规的六个最佳实践

PS5 Restock Issues Ending Soon, So Says Sony

Kicking Off a Pipedream Workflow on a Full Moon (Because Why Not?)

三大芯片巨头瞄准Wi-Fi 7

Mid-Level Full Stack Engineer (Backend Focus) Rails + React/Next.js

互联网时代，HR如何做一个合格的招聘官

SKIL's $129 plunge and fixed base router combo has a built-in LCD for RPM sugges...

Roller Champions headlines 10 new games joining Nvidia GeForce Now

Azure Static Web Apps CLI is now GA (global access)

Newport plant critical to easing microchip shortage, says boss

About Joyk