6

Search indexing best practices for top performance (with code samples)

 2 years ago
source link: https://www.algolia.com/blog/engineering/search-indexing-best-practices-for-top-performance-with-code-samples/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Search indexing best practices for top performance (with code samples)

Every search interface relies on a fast back-end data-indexing process that keeps its search results up to date in as timely a manner as possible. But search indexing is only one side of the coin. The other side is the real-time speed of a high-quality relevant search engine. 

For all search engines, the search request is the highest priority, with indexing a (very) close second. There are several reasons for this, but the most important is a business argument: every search is a potential game changer, a path to a conversion. Any slow or dropped search request, or irrelevant result, is a potential financial or business loss.

To achieve maximum speed & relevance, a search engine must:

  • Prioritize search requests over indexing requests
  • Structure its indexes so that queries execute in real-time (milliseconds), with the best relevance 

As a result, it takes a little extra time to update an index. But if you learn to follow a few indexing best practices, you’ll even things out.

“All well and good,” say the full stack and back-end developers. “I understand the priority of search. But I want to know more about my data. How do I get my data onto your servers? Can it handle my use cases? Does it accept any kind of data? Is it simple, secure, fast?” 

In a recent article on indexing, we explored a variety of advanced use cases, and focused on two search indexing essentials: fast updates and wide applicability. Now it’s time to dig into the code and explain some speed-enhancing algorithms and indexing best practices that ensure you get the highest indexing speed for any search use case

There are two primary areas to focus on here:

  • The wide applicability of our indexes
  • The high performance of our indexing API to update your data

The wide applicability of our search indexing

To understand indexing on its own terms, we need to decouple it from search and outline the most popular indexing scenarios:

Indexing for search

A well-structured index provides the foundation for a fast and fully-featured customer-facing search interface, with great relevance. In fact, indexing is so important to search & relevance that it needs to be designed and implemented with as much care and dedication as the front end.

Indexing to create a company-wide, multi-purpose, searchable data layer

Multiple indexes can form a single touch point for all back-office data. When put together in a certain way, your indexes can create a company-wide searchable data layer that lies between your back-office and all front ends used internally (employees) or externally (customers, partners).

Indexing as a “matchmaker” – the collaborative indexing use case

The “matchmaker” scenario is when Company X builds an Algolia index and makes it available to external data providers. In this scenario, Company X builds a collaborative website, such as a marketplace or streaming platform, where it displays the products/media of multiple vendors, partners, and contributors. To accomplish this, Company X exposes its Algolia index to these external data providers, allowing them to send data once they understand the format.

Here’s the main difference between the first two scenarios:

  • A single search interface requires at least one index, which should be structured with that interface in mind.
  • In the company-wide data layer scenario, it’s different: you need to generalize the structure of your index(es). The data that makes up this multi-purpose data layer needs to be structured to (a) allow multiple feeds of data from widely different back-office applications, and (b) serve multiple use cases and interfaces, whether user-facing or system-to-system.

What about indexing performance?

The wide applicability of our indexing wouldn’t be possible nor survive the competitive digital business environment if it were not performant in all situations. While we offer high indexing speed out of the box, this hinges on implementing best indexing practices. That’s what this article is about.

Just a word about what we mean by “out-of-the-box high performance”. Our indexing comes with the following technologies:

  • A search engine using advanced indexing techniques
  • High-performant bare-metal servers configured for performance 
  • A globally available cluster-based cloud infrastructure, with low latency and server redundancy (i.e., no server downtime)
  • An API with a retry method to ensure (contractually) 99.99% availability 

Best practices for fast indexing performance (with code snippets)

The most important indexing practice is to run a batching algorithm that updates multiple records in one indexing operation, in a regular and timely manner. This is true for all use cases. 

Why do we recommend batching? Because there’s a small performance cost to every indexing request. An index request involves a small “reindexing” of your entire index, which could take up to 1 second, or more if the index is very large. Thus, sending 100s of indexing requests, one record at a time, can create an indexing queue that will slow down the entire indexing process. To mitigate this, it’s important to limit the number of indexing requests made on the server by sending less requests.

Taking all that into account, here are the 3 most important indexing best practices (pretty standard fare for data updates):

  1. Batching updates instead of sending updates one record at a time
  2. Incremental updates instead of full (re)indexing
  3. Partial indexing (updating only changed attributes)

1 – Batch indexing instead of updating one record at a time

One common mistake is to send one record at a time. If your back-end data constantly changes, it would be wrong to send each change as it occurs. As stated above, bottlenecks occur when you create a queue of 100’s of indexing requests that are waiting to be processed.

Instead, as a best practice, use batch indexing. You send each change to a temporary cache, and then regularly send that cache to Algolia, for example, every 5 minutes or 30 minutes for larger indexes. Never batch faster than 1 minute, because you’ll end up creating a bottleneck. 

This code example builds a new index. It batch-saves 10000 record-chunks using the save_objects method of Algolia’s Python API.

#python
import json
from algoliasearch import algoliasearch
client = algoliasearch.Client('YourApplicationID', 'YourAdminAPIKey')
algolia_index = client.init_index('bubble_gum')
with open('bubble_gum.json') as f:
  records = json.load(f)
  chunk_size = 10000
  for i in range(0, len(records), chunk_size):
    algolia_index.save_objects(records[i:i + chunk_size])

See how our API has automated the batching process.

2 – Incremental updates instead of full indexing

Improving upon the previous suggestion, you don’t want to send too many records in a single batch. To reduce indexing request sizes, you should perform incremental updates, where you update only the new records.

This code adds a new Bubble Gum series.

#python
algolia_index.save_objects([
  {"objectID": "myID1", "item": "Classic Bubble Gum", "price": "3.99"},
  {"objectID": "myID2", "item": "Raspberry Bubble Gum", "price": "3.99"},
  {"objectID": "myID3", "item": "Cherry Bubble Gum", "price": "3.99"},
  {"objectID": "myID4", "item": "Blueberry Bubble Gum", "price": "3.99"},
  {"objectID": "myID5", "item": "Mulberry Bubble Gum", "price": "3.99"},
  {"objectID": "myID6", "item": "Lemon Bubble Gum", "price": "3.99"}

Note: It’s a good idea to do a full reindex of all records every night or at least weekly.

Check out our complete incremental updating solution.

3 – Partial indexing (updating only changed attributes)

To lower the indexing traffic even more, you’ll want to send only the attributes that have changed, not the whole record. For this, you’ll use a partial indexing strategy.

This code changes only the price of some of the Bubble Gums, no other attribute.

#python
algolia_index.save_objects([
  {'objectID': 'myID1', 'price': 4.99},
  {'objectID': 'myID3', 'price': 4.99},
  {'objectID': 'myID6', 'price': 2.99}

Check out our complete partial-indexing solution.

Next readings

Our first article on indexing presented a high-level overview of standard and advanced indexing use cases. This article walked you through indexing best practices and the implementation details of a standard indexing process. Our next article discusses how to optimize indexing in advanced use cases.

Our remaining articles will provide front & back end code for some of the advanced indexing use cases we discussed, starting with real-time pricing.

To get started with indexing, you can upload your data for free, or get a customized demo from our search experts today.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK