The past, present, and future of semantic search

Nov 4th 2022 ai

Technologists who design search solutions are beginning to hear a lot about semantic search. But, what is semantic search? And what does state-of-the-art semantic search even look like?

There’s no single technology of semantic search. It’s like the term “AI” — a marketing term that can mean almost anything related to machine learning. In 1999, Tim Berners-Lee was one of the first to introduce the idea of the semantic web. Since then, the term “semantic search” has referred to many different technologies used in query processing.

In this post, I’ll explain semantic search and describe some of the major technologies in which it is used to deepen query processing. I’ll look at where these technologies come from, how they work, and where they’re headed.

Spoiler alert: Whether you’re adding search to an e-commerce site or an enterprise intranet, semantic search technologies in conjunction with more traditional keyword search delivers the most complete and relevant results.

A brief history of query understanding

Keyword search and statistical ranking: starting 1970s

Keyword search has been around for a long time and works much like the index at the back of a book. A keyword search engine creates an index of all words across all documents and delivers results based on simple matching algorithms.

To improve search relevance and result ranking, search engines introduced word statistics, like TF-IDF and BM25. Statistical search looked at the inverse frequency of a word in a document (IDF) versus the term frequency of a word (TF) to determine its importance. For example, stop words like “the” “and” “or” show up frequently everywhere, whereas words like “toothbrush” or “water” show up less frequently — that is, they are more uncommon. Term frequency can be used as a proxy for how important, or relevant, the document is

Frequency-based statistics was very rudimentary and relied on exact matches. Keyword search algorithms built with Lucene APIs still rely on these statistical formulas today across a wide range of applications — it’s extremely simple to implement and fast. However, to improve accuracy customers must create synonym libraries, add rules, use additional metadata or keywords, or do other kinds of workarounds.

Introduction of NLP: starting 1980s

Statistical ranking was useful, but not enough; there were too many use cases where the words did not precisely match the query. For example, singular vs plural terms, verb inflections (present vs past tense, present participle, etc.), agglutinative or compound languages, and so forth.

This led to the development of natural language processing (NLP) functions to help manage the complexity of languages. Some of these processes include:

Stemming: Stemming is the process of converting the words into their base forms by removing prefixes and suffixes. This reduces resource usage and improves computing capability. For example, “change” and “changing” are converted to a root form “chang”.
Lemmatization: Similar to stemming, lemmatization brings words into their base (or root) form. It does so by considering the context and morphological basis of each word. For example, “changed” is converted to “change” or “is” to “be”. An important thing to note is that both stemming and lemmatization are used to reduce words to their original formats, so most projects do one or the other.

Word segmentation: In English and many Latin-based languages, the space is a good approximation of a word divider (or word delimiter), although this concept has limits because of the variability in how each language combines and separates word parts. For example, many English compound nouns are variably written (ice box = ice-box = icebox). However, the space is not found in all written scripts, and without it, word segmentation becomes a difficult problem. Languages which do not have a trivial word segmentation process include Chinese and Japanese, where sentences but not words are delimited; Thai and Lao, where phrases and sentences but not words are delimited; and Vietnamese, where syllables but not words are delimited.

Speech tagging: Speech tagging, also called parts of speech (PoS) tagging, is a way of classifying lists of words as nouns, verbs, adjectives, etc., to more accurately process a query. It looks at the relationship between words in a sentence to improve accuracy by more clearly “identifying the meaning” of the sentence.

Image via Medium

Entity extraction: Entity extraction is another technique for NLP that has become particularly important for voice search. As the name might suggest, entity extraction is a way to identify different elements of a query — people, places, dates, frequencies, quantities, etc. — to help a machine “understand” the information it contains. Entity extraction is a very good solution for overcoming simple keyword search limitations, but, like ontologies and knowledge graphs discussed below, it only works on specific domains and queries.

Ontologies and knowledge graphs: starting 2005

Another method for developing a better, semantic understanding of a query was the use of ontologies and knowledge graphs. Knowledge graphs represent a relationship between different elements — concepts, objects, events. An ontology defines each of the elements and their properties.

Together, this semantic approach attempted to represent different concepts and the connections between them. Google, for example, used a knowledge graph to not only match the words in the search query, but also look for entities that the query described. It was a way to get around the limitations of keyword search.

However, in practice, a knowledge graph and ontology approach are very hard to scale or port to different subjects, and subjects get out of date quickly — sports teams, world leaders, or even product attributes. The knowledge graph and ontology you build for one domain won’t easily transfer to the next domain. While you can build highly robust solutions for one subject, it may fail completely for a different subject where a different area of expertise is needed. Only a few big companies including Google were able to develop a knowledge graph automatically. Most other companies had to build them manually.

Autocomplete: starting 2004

Autocomplete is a very useful semantic search tool for effectively helping customers find results faster. The most popular example is Google who released autocomplete at the end of 2004.

Autocomplete is an approach that attempts to anticipate search terms to help customers enter their query. It also offers contextual suggestions, helps users avoid typos, and filters content based on user’s location or preferences. The suggestions are generated by a series of algorithms that rely on multiple machine learning and natural language processing algorithms and models to generate matches, starting with a simple prefix string to identify, match, and predict the outcome of the unfinished search query.

For autocomplete to work effectively, a search engine must have a lot of data to work with across all sessions and, additionally, it must also be able to anticipate search terms for each user based on their behavior, previous searches, geolocation, and other attributes.

Predictive autocomplete has now become an expected feature for any modern, competitive search engine.

AI ranking: starting 2007

Early keyword probability models like BM25 had built relevance using term frequency, as discussed above. AI ranking took a big step forward by incorporating user feedback to further identify relevancy. One example of this is reinforcement learning. The basic idea of reinforcement learning is quite simple: use feedback to reinforce (strengthen) positive outcomes. Instead of making large changes infrequently, reinforcement learning makes frequent incremental changes. There are many upsides to this, such as continuously improving results and faster surfacing of other potential results. Additionally, poorly performing results tend to fall away quickly through rolling experimentation.

Like Autocomplete, reinforcement learning needs a lot of data to return meaningful results; it’s a poor solution without significant historical performance data. Furthermore, reinforcement learning tends to be very good for search result ranking, but it does not help to identify records, it still relies on the keywords and linguistic resources to identify matching records.

This is where vectors come into play.

Vector search: starting 2013

Vector representation of text is very old. Its theoretical roots go back to the 1950s, and there were several key advances over the decades. We’ve also seen great innovation starting in 2013: new models based on neural networks leveraging large training sets (in particular BERT in 2018 by Google) have set the standard.

Image via Google. This diagram shows vector dimensions along simple axises. In practice, there can be thousands of dimensions in use.

What is vector search? At its most simple, it’s a way to find related objects that have similar characteristics. Matching is accomplished by machine learning models that detect semantic relationships between objects in an index. Vectors can have thousands of dimensions, but to simplify it, we can visualize vectors with a three-dimensional diagram (above). Vector search can connect relationships between words, and similar vectors are clustered together. Words like “king,” “queen,” and “royalty” will cluster together, as will words like “run,” “trot,” and “canter.”

Almost any object can be embedded and vectorized — text, images, video, music, etc. Early vector models were using words as dimensions; every different word was a dimension and the value was the count of the word, which was overly simple. That changed with the advent of latent semantic analysis (LSA) and latent semantic indexing, LSI, which analyzed the relationship between documents and the terms they contain by reducing the number of dimensions. Today, newer AI models powered by vector engines are able to quickly retrieve the information in a high dimensional space.

This has been a game changer. Newer vector-based solutions can now know that “snow,” “cold,” and “skiing” are related ideas. This advance has made some of the other technologies mentioned above — such as entity extraction, ontologies, knowledge graph, etc. — obsolete.

So, why don’t vectors power all searches? Two reasons primarily. One is that they’re slow and expensive to scale. The other is that vector search doesn’t return the same quality of results as simple queries do in several important use cases.

Online, consumers expect instant search results (Amazon and Google have both done studies at the negative outcomes of even 100 milliseconds lag on consumer behavior). You can speed up and scale vector delivery, but it’s expensive, and will never be equal in speed compared to keyword search.

Vectors also don’t provide the same quality of relevance as keyword search for some queries. Keyword search still works better than vectors on single word queries and exact brand match queries. Vectors tend to work better on multi-word queries, concept searches, questions, and other more complex query types. For example, when you query for “Adidas” on a keyword engine, by default you will only see the Adidas brand. The default behavior in a vector engine would be to have all shoe brands for the “Adidas” query (e.g., Nike, Puma, Adidas, etc.) because they are all in the same conceptual space. Keyword search still provides better — and more explainable (and tunable) — results.

How can you get the best of both worlds? That’s where hybrid search comes in.

Hybrid Search: 2022 and beyond

Hybrid search is a new method to combine a full-text keyword search engine and a vector search engine into a single API.

There is tremendous complexity in running both keyword and vector engines at the same time for the same query. Some companies have opted to go around the complexity by running these processes sequentially; they run a keyword search and then, if a certain relevance threshold isn’t met, run a vector search. There’s a lot of poor tradeoffs for doing this — speed, accuracy, and limited ability to train each model.

True hybrid search is different. By combining full-text keyword search and vector search into a single query, customers can get more accurate results fast. Of course, for vector search to work as fast as keyword search, it requires the search engine to scale in terms of performance without adding insane costs. For most vector engines today, this is not possible.

This is where Neuralsearch™, a new technology acquired from Search.io, is able to help. Neuralsearch is one of the only true hybrid search services that delivers single digit millisecond query times no matter the scale or query throughput.

This has been achieved by using hashing technology that reduces vectors to 1/10th the file size and requires no specialized hardware or GPUs to scale. I won’t go into details here on how hashing works — we have an entire blog post on vectors vs hashes — but I will say that we’ve seen some incredible results that are…

On par with the fastest keyword search available today
Noticeably more accurate due to utilizing both semantic and keyword retrieval together
Much easier to retrain exceptions thanks to a one-click teaching module we’re adding to the UI — more on this soon!

The next decade

How search looks in another 10 years is hard to predict, but it will almost certainly consist of hybrid search capabilities — a combination of full-text keyword and vector search technologies — which are more accurate than either technology alone.

If you are a technology leader tasked with implementing semantic search, I hope I have helped you to understand some of the major milestones in semantic technology, where we are today, and what state of the art will look like in the future.

The past, present, and future of semantic search

The past, present, and future of semantic search

A brief history of query understanding

Keyword search and statistical ranking: starting 1970s

Introduction of NLP: starting 1980s

Ontologies and knowledge graphs: starting 2005

Autocomplete: starting 2004

AI ranking: starting 2007

Vector search: starting 2013

Hybrid Search: 2022 and beyond

The next decade

Recommend

《龙与地下城》免费在线观看高清播放-高分电影-雷神影院

Airbnb Adding Cleaning Fees to 'Total Price' in Search Results

《去联谊却发现没有女生》免费观看完整版-热播电视剧-雷神影院

Unity中国首次亮相2022中国国际进口博览会，用实时3D技术为数字化进程提速-品玩

昨天把我3D打印的狮身人面像做了一个涂绘

How to quickly eliminate stress.The Lion's Breath Technique:

A sneak peek at early-stage startups exhibiting at TC Sessions: Space

Musk sees a big role for crypto on Twitter. He’ll face a tall climb.

Elon Musk Shares Meme of a Nazi Soldier on Twitter

杭州余杭区上线浙江省首个县级融媒体中心真人数字人主播

About Joyk