Part 2: Supercharging search for ecommerce solutions with Algolia and MongoDB — Proposed solution and design

Jul 14th 2022 engineering

We invited our friends at Starschema to write about an example of using Algolia in combination with MongoDB. We hope that you enjoy this four-part series by Full Stack Engineer Soma Osvay.

If you’d like to look back or skip ahead, here are the other links:

Part 1 – Use-case, architecture, and current challenges

Part 3 – Data pipeline implementation

Part 4 – Frontend implementation and conclusion

When we discussed the challenges of integrating a third-party indexing system into the product, our engineers instantly brought up three potential problems:

How do we maintain data integrity and data readiness across multiple data providers?
How do we ensure that our application’s performance isn’t affected by the introduction of a third-party system?
How do we maintain our existing security and access control rules on the third-party system?

So far, we had one single source of truth database (the Listings database) where all the listings are stored. When introducing Algolia into the ecosystem, we have to prioritize keeping it up-to-date with that database. Any inconsistency between these systems can have a serious impact on our site’s UX. We wouldn’t want to end up with a situation where a search result:

throws a 404 Not Found error when clicked
is not up-to-date with the listing itself (i.e. the search result contains different title, description, etc)
does not show up for an existing real-estate listing

All of these scenarios would result in a loss of confidence in our service and can directly be translated to a loss of revenue. It is absolutely essential that we both create an initial load of our existing dataset to Algolia and keep Algolia up-to-date with all the future changes in that dataset.

Our backend application is already under heavy load. It is scaled horizontally using Kubernetes, but we want to avoid greatly increasing the cost of operations due to higher traffic on our servers. When designing a solution, we have to offload as much traffic as possible to Algolia.

We also want to make sure that we don’t compromise our application’s security and access control. Currently, our application does not require a logged-in session to query listings, so this is not as important, but if somebody does log in, it would be nice to be able to store the user’s identity with Algolia so it can be used to personalize search results and refine our internal reports.

Design possibilities

We can break this down into three tasks:

Index all of our existing listing data inside Algolia. To do this, we’ll create a Python script inside a Jupyter Notebook to develop the data-loading logic. We’ll be able to reuse parts of this later. If you’re wondering why I chose a Jupyter notebook here, it’s because it allows fast iterations, partial executions, and a simple commenting system. That all helps a lot when prototyping and doing code reviews. I’ll implement this in the third article in this series.
Update the Algolia index regularly with any listing changes.
Create the ability to search Algolia directly from the frontend so we don’t need to touch any legacy backend code.

The updated architecture diagram would look like the following:

Let’s take a look at some of the advantages and disadvantages of the different paths we could take on task #2. Here are some of our options:

a. Create more Python scripts that are run as part of our existing ETL processes (which sync data into Mongo already). This could be a good option because Algolia index is updated at the same time as the MongoDB data, keeping them in sync. Also, it runs as a separate task in my existing ETL process, so it can be monitored & maintained easily. On the other hand, if the database load task succeeds and the Algolia task fails, we’ll have some inconsistencies between our datasets. That could require manual correction, which puts a large burden on our team.
b. Create Python scripts that sync data from Mongo to Algolia independently from our other ETL workflows. This lets us maintain and monitor Algolia independently as it refreshes its data based on the Mongo database regularly. This could put some extra strain on the data platform team, though, as it has to be hosted & maintained separately.
c. Use MongoDB Triggers. The idea here is that when a record is added, removed, or edited in MongoDB, it will directly sync into Algolia through a Database Trigger setting off a Function that calls the Algolia REST API. This will automatically keep our index up-to-date based on actions done in MongoDB without the need to implement third-party solutions. This plan isn’t without cons though. MongoDB operations can take very long to execute, so performance could become an issue. Triggers can fail as well, and so we’ll still have to monitor them manually in the MongoDB interface.

Whatever option we end up choosing, we’ll implement it in the third post of this series as well.

Lastly, part four will focus on creating a small web-based frontend to query the Algolia index. I want to be able to show our frontend developers a working solution with basic code so they can evaluate the time and effort required to integrate it into our existing frontend application.

Dataset & technologies used for implementation

To keep the implementation simple, I will use a dataset that is publicly available for MongoDB and is similar to my production data. There are multiple reasons behind this:

I want my scripts to be generic enough to be used for more than one of our applications in the future. The best way to build this into the scripts from the beginning is to design and develop the scripts for something other than my final dataset. This way, I can test the adaptability of the scripts simply by running it against my production data later.
We’re also interested in community feedback here! I’m using a public dataset so you can give this a shot and tell us about your experience.
Our MongoDB instance and all of our ETLs are behind VPNs, and I am sitting at home. Since our VPN is slow, I don’t want the load times to give me unrealistic performance counters when moving the data from Mongo to Algolia.

I decided to go with MongoDB’s official Sample AirBnB Listings Dataset, as it is fairly close to our existing data structure. I’m also going to use MongoDB Atlas to host my sample database as well as a free Algolia account to store the records. I might be an expert in Python already (which is why we’re using Jupyter notebooks), but I’m not in the frontend languages of HTML, CSS, and JavaScript, so this will be a great opportunity to test if Algolia’s SDKs are as simple as they’re made out to be.

In the first article of this series, I talked about our use-case, architecture and the search challenges we are facing.

In the third article of this series, I will implement the data ingestion into Algolia and figure out how to keep that data up-to-date.

In the fourth article of this series, I will implement a sample frontend so we can evaluate the product from the user’s perspective and give the developers a head-start if they choose to go with this option.

Part 2: Supercharging search for ecommerce solutions with Algolia and MongoDB —...

Part 2: Supercharging search for ecommerce solutions with Algolia and MongoDB — Proposed solution and design

Design possibilities

Dataset & technologies used for implementation

Recommend

AWS Remote Database Management Without SSH

数字藏品市场前景可期上市公司布局热情高

跨境保健品牌争相布局益生菌天猫国际加速全球供应链整合-品玩

Lilo and Stitch

NFT Marketplace OpenSea Lays Off About 20% of Staff

一个播放页，QQ音乐就设计了37个功能

Windows 11 Build 22000.829 (KB5015882) adds urgent notifications bypass and more

Learn SmartPlant® 3D (SP3D) Online Training Course - Multisoft

有了WPS移动版，工作效率翻5倍-品玩

科技爱好者周刊（第 213 期）：知识孤岛，知识软件

About Joyk