5

How Machine Learning helped power VTS Tenant Network Services

 2 years ago
source link: https://buildingvts.com/how-machine-learning-helped-vts-build-tenant-network-services-6e4857ca8a4b
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

How Machine Learning helped power VTS Tenant Network Services

1*SgyqDx4yaWSxvXb9ZF0skQ.png

(Special shoutout to Carlo Bailey, who was the original author and data scientist for this initiative)

VTS’ mission is to be the number one commercial real estate platform where industry professionals operate and make data-driven decisions. Today, we have over 12 billion square feet of real estate on our platform, with thousands of the top landlords and brokers using our platform to streamline their operations. As such, our VTS Lease product has millions of tenants entered by thousands of landlords across properties worldwide.

With this data in hand, we can help our customers understand how real estate demand is evolving and highlight the existing tenant relationships they may already have. These two insights were fundamental in building out VTS’s recently launched Tenant Network Services on VTS Lease.

Achieving that required overcoming a significant hurdle: resolving tenant records from the hundreds of different technologies we integrate with. Here’s an example:

  • One landlord might have a potential lease with a company called Acme, Inc., leasing property in Queens
  • Another landlord might have a potential lease with a company called Acme.com in a property in Manhattan entered by a different broker.

Are these companies the same? Do they represent the same demand? Our job is to determine whether these two Acme Inc variations refer to the same company. This process is often known as entity resolution, which involves matching records across disparate datasets that refer to the same entity. This blog post will outline our solution to automate this process using machine learning.

What is Entity Resolution?

Entity resolution is a common problem in the data world. It involves reconciling entities across multiple data sources that share some common identifier (e.g., the same name but with minor spelling differences, location, URL, etc.). In the below example, we have three different records for Amazon, Inc. The entity resolution process would help us identify that these three different versions refer to the same company. However, technically, Amazon Digital Services is a subsidiary of Amazon, Inc.

1*zLHbT-AOXMYvdtwyj5TFrA.png

There are many names for reconciling entities (e.g., record linkage, deduplication, fuzzy matching, named entity linking, etc.). There are also many varying approaches to entity resolution, each with its advantages and disadvantages:

1) Rules-based

This is the most straightforward approach to entity resolution, where links are created across records using some pre-established business logic. Records are matched via a deterministic record linkage procedure if all or some identifiers (above a certain threshold) are identical.

A deterministic record linkage is a good option when a common identifier exists or when there are several high-quality representative identifiers (e.g., name, industry, location when identifying a company).

2) Fuzzy matching or approximate string matching

This method seeks to find pairs of text that match approximately rather than exactly. This is a common problem within computer science, with many well-known algorithms to solve it.

One of the most used algorithms is Levenshtein distance, which involves measuring the number of edits it would take to turn one string into another. This metric could augment a rules-based approach to establish weights or thresholds to decide when two records are a match. Algorithms range from those that focus on formal similarity (e.g., Levenshtein Distance) to those that focus on semantic relationships (see diagram below).

0*i-oVdTCP0-BoNSCy

3) Machine learning & natural language processing (NLP)

More recently, varying machine learning approaches have been used for the process of entity resolution. It involves training an algorithm with historically verified human matches to create a confidence score of two records being a good match.

Our Approach

Recognizing the advantages and pitfalls of each classic entity resolution technique, we chose to take a hybrid approach when tackling entity resolution. Given that we needed our system to be (1) highly accurate, (2) scalable (as we’re handling millions of records), (3) fast, and (4) tolerant of variable data quality, we chose to pursue an approach combining machine learning, fuzzy matching, and business logic. Our methodology can be broken down into four steps:

0*Kll1Xud_IdNimGOg

Step 1: Tokenization

We first convert the record’s strings (whether tenant name, industry, market, URL, etc.) into a normalized sequence of text. This is a common NLP task that involves converting strings to lowercase characters, splitting words on empty space, lemmatizing, converting to N-grams, etc., to make the characters more machine-readable.

Step 2: Blocking

In reality, VTS is attempting to match thousands of tenant records with millions of existing companies. This leads to billions of possible pairwise comparisons between records to find a match, which is computationally infeasible! The blocking step reduces the number of comparisons by using cosine similarity to only keep records within the same “block” (i.e., keeping the N most similar records). We split tenant strings into high dimensional trigram vectors and compute the pairwise cosine similarity between vectors.

0*iphyKRoSdmKFsP1V

Step 3: Feature extraction and fuzzy matching

Once similar records are identified, we run a series of string matching algorithms to identify tenant records that have highly similar names, matching industries, similar locations, and other attributes. There is a lot of nuance in this step as things like industries are often not consistent in naming convention (e.g., technology and software could refer to the same industry). Therefore, we use fuzzy matching algorithms and language models to identify semantic similarity.

Step 4: Machine learning

Finally, we use thousands of human-verified tenant matches (shoutout to our Data Ops team!) to train a machine learning classifier to predict the likelihood of a match between two matches. Given the amount of data VTS has in our system, we were able to produce a model with high precision and recall.

Leveraging AWS Sagemaker

Given the amount of data in our system and the need for a globally scalable service, we leveraged AWS Sagemaker to build an automated machine learning pipeline that distributes the tasks above across multiple machines in the cloud. Check out our AWS blog post with Provectus to learn how we scaled our machine learning infrastructure to bring this tenant linking model into production.

Final Takeaways

While we faced some operational and technical hurdles along the way, the tenant linking pipeline is now being used by stakeholders across the company and has immediately impacted our business.

  • Having clean, deduplicated records of tenants provides landlords with immense insight into the operations of their physical spaces and a complete view of their tenant relationships (check out Tenant Network Services to learn more!)
  • The automated pipeline increased the velocity of linking tenants across data sources, traditionally a somewhat manual process involving hundreds of contractors. With this tool, our data operations team can quickly link tenants with very little oversight, allowing us to scale as our business grows.

Are you interested in tackling problems like this? Consider joining the VTS Data Science team! We are looking for talented data scientists to join us. Feel free to check out our careers page for openings.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK