Why weights are often counterproductive in ranking

Search is a very complex problem

Search is a complex problem that is hard to customize to a particular use case — even search experts need a lot of iteration to configure a search engine.

Search is composed of several steps that all require different configurations. It is usually split into three steps:

Step 1: Retrieve all hits relevant potentially to the query
Step 2: Rank all those potential results to make sure the best results are first
Step 3: Reranking or dynamic ranking where external signals are used. Typically we apply a learning-to-rank approach.

In most modern search engines, the first two steps are executed concurrently for performance reasons, but each has distinct benefits — first to expand the search for the largest possible result set (ie, optimize for recall) and then order results from most to least relevant (ie, optimize for precision).

Independently of the technology we use, both the retrieval and the ranking processes are difficult and contain a lot of parameters.

In this article, we explain where weights (also called boosts) are used as a solution to these challenges in search. Weights can feel like an intuitive solution, which is why they’ve been used for a long time. However, they can actually be dangerous and often counterproductive.

Challenges in the retrieval phase

The main challenge associated with the retrieval phase is to ensure that all the potentially relevant records are found. This challenge takes a different approach depending on the technology that is used.

In a keyword-based search engine, the retrieve phase challenge is to ensure that all the potential hits are scanned. The engine needs to identify all the approximations that may be relevant. In particular, the engine needs to identify all the synonyms, and all the typos including some complex ones like a concatenation or a split of two keywords, explore partial matches (only a subset of the query words are found), etc.
In a typical vector-based search engine using LLMs, the retrieval phase challenge is to keep an acceptable performance as all hits have a similarity score that is not zero with the query. To make it even more difficult, the scores for relevant vs irrelevant items can vary greatly per query! Most engines use a graph representation (HNSW being the most popular) to scan only the results that have a vector close to the query. Most engines are finding that bi-encoders produce only rough approximations. And, there is no one threshold score that works for relevance. Thus it is now commonplace to include a re-ranker as a merge step between keyword and vector hits, such as a cross encoder. Cross encoders retain more of the network attention relationships that make transformers work and can thus make much better decisions on relevance even when the hits are derived from keyword retrieval. However they need to process the query and hits together, so they are not efficient for retrieval (like bi-encoders).

Today, more and more search engines rely both on keyword search and semantic search, meaning that the two challenges need to be addressed.

Challenges in the ranking phase

In the ranking phase, the challenge is to merge all the signals together to have one final way to order the results. In particular, there are three categories of signals that are merged together in the ranking

Textual signals: the information coming from the search engine technology when comparing the query and the retrieved record. There are often parameters This can be the output of a textual score in a keyword engine like a BM25F score or the similarity score between the query vector and the text.
Business signals: the information that is business-related to help sort the result independently of the textual relevance between the query and the record. In particular, it can use a notion of the popularity of the item, availability of the item, items promoted for advertisement, contribution to the business objective, etc.
User signals: the information that is specific to the customer. It contains some notion of availability (what are the records that the user can access), notion of preference, geography, etc.

Mixing those criteria is very complex and the reflex is often to use weights set by the business to merge all those signals.

Weights inside the Textual Score

In the retrieve phase, the query information can be found inside multiple attributes (also called fields) of a record. Not all attributes have the same importance and this is why most engines ask the customer to set a score (or a boost) to every attribute to translate the idea of the business importance of an item. Setting those scores is not an easy task and often leads to relevance problems as there is an infinite number of configurations that are possible, a lot of these configurations producing very similar results.

From the business point of view, there is a difference between the value of each attribute. For example, in an ecommerce store, it is not the same to match the query “laptop 13” in the “category” attribute than in the “description” attribute (you can imagine this is not relevant to match a backpack that has a pocket for laptop on the “laptop” query).

The problem with setting the weights manually is that you never know if you have the best configuration or not, and the best configuration actually varies for different query types. So there is no one solution. You are limited by the number of different configurations you can test, each of them requiring a test period. Even worse, it is common to iterate on the weights based on the observation of a small set of queries without checking in advance the global impact of this change.

Instead of setting those weights manually, it is better to give a “hint” to an AI algorithm that will optimize the weight automatically and constantly (because the data are also changing over time). Such a hint is usually a list of attributes ordered from the most important to the least important with potentially some equality. For example, (“Category”, “Brand”, “Color”) > “Name” > “Description” when the business knows that “Category”, “Brand” are clean data reviewed by the business and Name/Description are more generic.

A weight to merge the Textual signals and the Business signals

Another common usage of weight is to merge the textual signals and the business signals (score = 𝝈 x TextualScore + (1 – 𝝈 ) * BusinessScore). For example, a weight of 0.5 will give the same importance to the textual score and to the business score and compute a final ordering. This seems very intuitive but creates a lot of relevancy issues. To illustrate the problem, let’s take an extreme example where you want to sort by a business signal. In this case, you totally ignore the textual signals to sort by one business criteria (sorting by increasing price for example). Because the search engine is designed to consider all potential hits, even the ones that are far away from the query, you will end up with results that are very far from the query because they are cheap. This is a problem you find on a lot of websites, often forcing the customer to filter the query to have something a bit more relevant.

The fundamental problem is that the business score is merged in the same way for all results, while some of them are very far from the query and should be eliminated. For the sort, one classical solution is to filter the result set and only keep the good textual scores before applying the sort. In practice, most marketplaces use such an approach. (you can detect such an approach if you’re on a site where the number of results for the “sort by” is lower than the initial search).

For the general problem of merging the Textual Signals and the Business Signals, we have exactly the same problem as in the extreme case of a “sort by”. To be appropriate, it requires to have several buckets of textual relevance and to then apply the sort inside each bucket.

Similar to the problem of weights inside the Textual Score, the best approach is to define the business score and let an algorithm optimize the merging for you depending on the user behavior.

Weights between different signals

Ranking is a complex merge problem between a lot of signals. It is tempting to merge all of them with weights, at least at a high level between the Textual Signals, Business Signals, and User Signals. In practice setting those weights manually is one order of magnitude more complex than the previous two examples and is a setup for failure.

Depending on the context (the query, the section of the website, etc.), the merge can be different to provide relevant results to the user. The search engine can potentially use weights and a machine learning algorithm to set them, this is fine. But setting those weights manually is never something that delivers the best results.

Weights are dangerous

Weights are more often part of the problem than part of the solution when they are set manually. The next time you will be asked to manually configure a weight, you should think about the danger of such a setting and if there is a way to automatically configure them. Noting that LTR (Learning To Rank) algorithms often use boosted trees and other non-linear ways of finding dependencies and relationships between the ranking signals. Thus the optimal way of setting the weights ends up different for basically every query. For the sort example above, this could learn that the relevance cut off should increase when a hard sort — like the price attribute — is applied.

If you want to learn how Algolia is tackling this challenge, watch this space to learn more.

Search is a very complex problem

Challenges in the retrieval phase

Challenges in the ranking phase

Weights inside the Textual Score

A weight to merge the Textual signals and the Business signals

Weights between different signals

Weights are dangerous

Recommend

The EU will reportedly rule against Apple in Spotify's complaint over App Store...

Google Docs no longer cuts it for Product Documentation

The Current Status of EF Core

Next-gen iOS app that responds as you breathe

Berkshire Hathaway buys Occidental Petroleum shares worth about $588.7 million

mental health solution

我的第一个GPTs

C#– Record types copy constructor

年轻人挤满推拿科夜门诊，会让中医推拿爆火吗？

东方甄选CEO孙东旭道歉：我不是故意的一定认真改正

About Joyk