Part 3: Supercharging search for ecommerce solutions with Algolia and MongoDB — Data pipeline implementation

Jul 14th 2022 engineering

We invited our friends at Starschema to write about an example of using Algolia in combination with MongoDB. We hope that you enjoy this four-part series by Full Stack Engineer Soma Osvay.

If you’d like to look back or skip ahead, here are the other links:

Part 1 – Use-case, architecture, and current challenges

Part 2 – Proposed solution and design

Part 4 – Frontend implementation and conclusion

Just a note before I get started: you can follow along with the implementation here.

In the last article, we analyzed our data pipeline architecture and left an open question about the way that we will run our Python scripts to load the Algolia index. There were 3 options:

Write Python scripts embedded in our ETL processes to update the Algolia index and MongoDB at the same time.
Host Python scripts that pull data from Mongo to Algolia completely independently from our existing ETL workflow.
Use MongoDB Triggers & Functions to update the Algolia index right after MongoDB updates.

After discussions with our engineering team, I decided to go with the first option, because we already have an established and sophisticated way of running our current data preparation pipeline with a lot of existing scripts to clean, aggregate, and format our data before we load it into our database. Adding an extra script here won’t take much effort, and all the maintenance and monitoring tools are readily available. After deciding on the architecture steps, I decided to make a single script that both performs the initial data load into Algolia and keeps the index up-to-date, instead of a script for each of those actions.

Thankfully, Algolia supports this kind of use-case by exposing a replace_all_objects method that actually creates a new temporary index first and then swaps it out with the live one once it’s done building. That makes for a near-instant transition between the old and the refreshed index without any downtime or data inconsistency.

Step 0. Planning

Before starting to implement my Python script, I had to register for a free Algolia account and create a sample dataset that I can use to fill my index using MongoDB Atlas.

I chose to go with the default AirBnB dataset that comes with Atlas out-of-the-box, because the format and use-case is very similar to my real-life data. I also made the sample dataset publicly hosted for anybody who is following along or would like to experiment:

Host: algolialistingstest.vswcm0y.mongodb.net
Username: ReadOnly
Password: AlgoliaTest
Database: sample_airbnb
Collection: listingsAndReviews

I decided to implement the script using a Jupyter Notebook, because it lets me test pieces of my code independently, annotate my code with Markdown, play around and model the data structure iteratively, and export the created Python code as a script file easily. It’s very versatile and interactive, and I generally love to use it. I’m hosting it on Google Collab, so I can share the code very easily without anybody having to install an on-premise Jupyter environment. You can find the implemented script here. We’re using the implemented script to:

Connect to Algolia using the Algolia Python API and validate the connection.
Connect to the MongoDB instance and retrieve sample data.
Prepare the Algolia index.
Load the dataset into Algolia from the MongoDB instance and replace the existing index.

Step 1. Connect to Algolia

The first step is generating an API key:

Register for a free Algolia account, or log in to your existing account.
After signing in, an Algolia application will automatically be created for you. You can either use the default unnamed application or create a new application.
Go to your API Keys section of your application and retrieve your Application ID and Admin API Key. You will need to use both of them when connecting your Algolia account from the Python code below.

We’ll need to install the Algolia Python client first, but afterwards, here’s what our connection code looks like:

# The Application ID of your Algolia Application
algolia_app_id = "[your_algolia_app_id_here]"
# The Admin API Key of your Algolia Application
algolia_admin_key = "[your_algolia_admin_key_here]"

# Define the Algolia Client and Index that we will use for this test
from algoliasearch.search_client import SearchClient

algolia_client = SearchClient.create(algolia_app_id, algolia_admin_key)
algolia_index = algolia_client.init_index("test_index")

# Test the index that we just created. We do this as part of the function, because these variables are not needed later
def test_algolia_index(index):
    # Clear the index, in case it contains any records
    index.clear_objects()
    # Create a sample record
    record = {"objectID": 1, "name": "test_record"}
    # Save it to the index
    index.save_object(record).wait()
    # Search the index for 'test_record'
    search = index.search("test_record")
    # Clear all items again to clear our test record
    index.clear_objects()
    # Verify that the first hit is our object
    if len(search["hits"]) == 1 and search["hits"][0]["objectID"] == "1":
        print("Algolia index test successful")
    else:
        raise Exception("Algolia test failed")

# Call our test function
test_algolia_index(algolia_index)

Step 2. Connect to Mongo and get data

First, install PyMongo, a Python MongoDB client, and then use this code to connect to our sample MongoDB database and read the sample data. Note that we’re only getting 5000 items so that we don’t overwhelm our free tier usage:

# Define MongoDB connection parameters. These are wrapped in a function to keep the global namespace clean
# Change these values if you are not running your own MongoDB instance
db_host = "algolialistingstest.vswcm0y.mongodb.net"
db_name = "sample_airbnb"
db_user = "ReadOnly"
db_password = "AlgoliaTest"
collection_name = "listingsAndReviews"

connection_string = f"mongodb+srv://{db_user}:{db_password}@{db_host}/{db_name}?retryWrites=true&w=majority"

# Connect to MongoDB and get the MongoDB Database and Collection instances
from pymongo import MongoClient

# Create MongoDB Client
mongo_client = MongoClient(connection_string)
# Get database instance
mongo_database = mongo_client[db_name]
# Get collection instance
mongo_collection = mongo_database[collection_name]
# Retrieve the first 5000 records from collection items
mongo_query = mongo_collection.find()
initial_items = []
for item in mongo_query:
    if len(initial_items) < 5000:
        initial_items.append(item)

Step 3. Transform our data into a form that suits Algolia

The objects in our MongoDB sample dataset contain many attributes, some of which are irrelevant to our Algolia index. We only keep those that are required either for searching or ranking.

The _id property will be kept, as it will be the Algolia object ID as well.
These properties will be kept either for searching, faceting, or displaying: name, space, description, neighborhood_overview, transit, property_type, address, accommodates, bedrooms, beds, number_of_reviews, bathrooms, price, weekly_price, security_deposit, cleaning_fee, images.
The review_scores attribute on the Airbnb entry will be transformed to a scores property, which will contain the number of stars that is given to the listing.
A _geoloc property will be added to the object based on fields in the original address object. This will be used for geosearching.
The following properties will be stripped completely since Algolia doesn’t need them: summary, listings_url, notes, access, interaction, house_rules, room_type, bed_type, minimum_nights, maximum_nights, cancellation_policy, last_scraped, calendar_last_scraped, first_review, last_review, amenities, extra_people, guests_included, host, availability, review_scores, reviews.

Here is this transformation code:

# We define a function first that is able to strip long texts longer than 350 characters. This is done because the sample data has some records with very long descriptions, which is irrelevant to our use-case and takes up a lot of space to display
def strip_long_text(obj, trailWithDot):
    if isinstance(obj, str):
        # Strip texts longer than 350 characters after the next full stop (.)
        ret = obj[:350].rsplit(".", 1)[0]
        if trailWithDot and len(ret) > 0 and not ret.endswith("."):
            ret = "."
        return ret
    else:
        return obj

# We define a function to validate number values coming from MongoDB. MongoDB stores numbers in Decimal128 format, which is not accepted by Algolia (only as string). This function:
# 1. Converts numbers to float from Decimal128
# 2. Introduces a maximum value for these numbers, as some values in MongoDB are outliers and incorrectly filled out and it gives range filters an unreal max value.
def validate_number(num, maxValue):
    if num is None:
        return num
    else:
        val = float(str(num))
        if val > maxValue:
            return maxValue
        return val

def prepare_algolia_object(mongo_object):
    # Create an instance of the Algolia object to index, and set its objectID based on the _id of the mongo object
    r = {}
    r["objectID"] = mongo_object["_id"]
    # prepare the string attributes
    for string_property in [
        ["name", True],
        ["space", True],
        ["description", True],
        ["neighborhood_overview", True],
        ["transit", True],
        ["address", False],
        ["property_type", False],
    ]:
        if string_property[0] in mongo_object:
            r[string_property[0]] = strip_long_text(
                mongo_object[string_property[0]], string_property[1]
            )

    # prepare the integer properties
    for num_property in [
        ["accommodates", 100],
        ["bedrooms", 20],
        ["beds", 100],
        ["number_of_reviews", 1000000],
        ["bathrooms", 100],
        ["price", 1000],
        ["weekly_price", 1000],
        ["security_deposit", 1000],
        ["cleaning_fee", 1000],
    ]:
        if num_property[0] in mongo_object:
            r[num_property[0]] = validate_number(
                mongo_object[num_property[0]], num_property[1]
            )

    # prepare the Sortable attributes (except for scores rating)

    # set rating if any
    if (
        "review_scores" in mongo_object
        and "review_scores_rating" in mongo_object["review_scores"]
    ):
        stars = round(mongo_object["review_scores"]["review_scores_rating"] / 20, 0)
        r["scores"] = {
            "stars": stars,
            "has_one": stars >= 1,
            "has_two": stars >= 2,
            "has_three": stars >= 3,
            "has_four": stars >= 4,
            "has_five": stars >= 5,
        }
    # set images
    if "images" in mongo_object:
        r["images"] = mongo_object["images"]
    # set GeoLocation if any
    if "address" in mongo_object:
        if "location" in mongo_object["address"]:
            if mongo_object["address"]["location"]["type"] == "Point":
                r["_geoloc"] = {
                    "lng": mongo_object["address"]["location"]["coordinates"][0],
                    "lat": mongo_object["address"]["location"]["coordinates"][1],
                }
    return r

Step 4. Define our index properties

Now let’s tell Algolia what to do with the properties we’ve given it. We’ll set [attributesToRetrieve](<https://www.algolia.com/doc/api-reference/api-parameters/attributesToRetrieve/>), the attributes that Algolia will return per search result for display in our UI, to an array of these properties: summary, description, space, neighborhood, transit, address, number_of_reviews, scores, price, cleaning_fee, property_type, accommodates, bedrooms, beds, bathrooms, security_deposit, images/picture_url, _geoloc. Our [attributesForFaceting](<https://www.algolia.com/doc/api-reference/api-parameters/attributesForFaceting/>) array will contain property_type, address/country, scores/stars, price, and cleaning_fee.

We’ll also set [searchableAttributes](<https://www.algolia.com/doc/api-reference/api-parameters/searchableAttributes/>), the attributes that are considered when a query is calculated. Algolia won’t waste time looking outside of this list for potential search matches, so it speeds up the query, and it lets us set the priority order from highest to lowest:

(top priority attributes) name, address/street, address/suburb
address/market, address/country
description (this will be an unordered attribute)
space (another unordered attribute)
neighborhood_overview (another unordered attribute)
(least priority) transit

We will also update the default ranking logic for our index:

(top priority) geo – providing search results close-by is the top priority for us
typo
words
filters
proximity
attribute
exact
(least priority) custom

We’re also updating our index to ignore plurals (which you might not think about much, but your users definitely will when it works as they don’t expect it to). You can find other great resources and settings on the Official Algolia API Reference page. Here’s what our code for this looks like:

algolia_index.set_settings(
    {
        "searchableAttributes": [
            "name,address.street,address.suburb",
            "address.market,address.country",
            "unordered(description)",
            "unordered(space)",
            "unordered(neighborhood_overview)",
            "transit",
        ],
        "attributesForFaceting": [
            "property_type",
            "searchable(address.country)",
            "scores.stars",
            "price",
            "cleaning_fee",
        ],
        "attributesToRetrieve": [
            "images.picture_url",
            "summary",
            "description",
            "space",
            "neighborhood",
            "transit",
            "address",
            "number_of_reviews",
            "scores",
            "price",
            "cleaning_fee",
            "property_type",
            "accommodates",
            "bedrooms",
            "beds",
            "bathrooms",
            "security_deposit",
            "_geoloc",
        ],
        "ranking": [
            "geo",
            "typo",
            "words",
            "filters",
            "proximity",
            "attribute",
            "exact",
            "custom",
        ],
        "ignorePlurals": True,
    }
)

Step 5. Load the dataset into Algolia from MongoDB

This short piece of code loads the dataset into the Algolia index, replacing the existing index so there are no out-of-date records.

# Prepare the Algolia objects
algolia_objects = list(map(prepare_algolia_object, initial_items))
algolia_index.replace_all_objects(algolia_objects, {"safe": True}).wait()

Script evaluation & performance

Overall, I found that loading an Algolia index from Python is quite a straightforward task, even though my Python skills are a little rusty. Most of my time actually went into preparing the AirBnB listing objects and transforming them into what I wanted inside Algolia. This would have probably been much simpler if I was working with our own datasets, as there wouldn’t have been as much transformation needed.

I learned that Algolia exposes a wonderful Python API — it’s simpler to use than I expected and contains great documentation that guided me through the entire process, step-by-step. The code required to prepare and load the index is minimal, and it felt intuitive to me. It also performed great when loading the index: it only needed just under 5 seconds to load and replace the entire index with 5000 records, even when run from a resource-limited, cloud-hosted server. When I ran it on some of our high-speed servers with fast Internet connection, it only took about 2 seconds. Our production dataset is much larger (about 40k records), but our standard pipelines that prepare the listings data are running for over an hour every day, so I am sure that our overall performance will not be affected by Algolia. So far, its simplicity and speed has far outweighed any drawbacks.

In the first article of this series, I talked about our use-case, architecture and the search challenges we are facing.

In the second article of this series, I covered the design specification of the PoC and talked about the implementation possibilities.

In the fourth article of this series, I will implement a sample frontend so we can evaluate the product from the user’s perspective and give the developers a head-start if they choose to go with this option.

Part 3: Supercharging search for ecommerce solutions with Algolia and MongoDB —...

Part 3: Supercharging search for ecommerce solutions with Algolia and MongoDB — Data pipeline implementation

Step 0. Planning

Step 1. Connect to Algolia

Step 2. Connect to Mongo and get data

Step 3. Transform our data into a form that suits Algolia

Step 4. Define our index properties

Step 5. Load the dataset into Algolia from MongoDB

Script evaluation & performance

Recommend

董宇辉回应新东方总裁级主播：一个努力的小镇做题家

“以我为主”！华为影像XMAGE用技术构建自己的影像品牌

“烂尾楼停贷”风波发酵！银行地产股几乎“团灭”

从民主简史谈起 Web3 信息时代的民主是怎样的？

朱松纯团队最新研究：机器人可与人类“推心置腹”！还说下一步要造“AI大白”

美国外卖平台GoPuff拟裁员1500人

携程Service Mesh性能优化实践

促消费助实体百城万店联合支付宝启动“夏日消费节”

Perceived affordances and the functionality mismatch

【新经济】钟薛高调查：线下仍在“热卖”，经销商利润超30%

About Joyk