2

API with NestJS #83. Text search with tsvector and raw SQL

 1 year ago
source link: https://wanago.io/2022/11/14/api-nestjs-text-search-tsvector-sql/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

November 14, 2022

It is very common to implement a feature of searching through the contents of the database. In one of the previous articles, we learned how to implement it in a simple way using pattern matching.

Today we take it a step further and learn about the data types explicitly designed for full-text search.

Text Search Types

PostgreSQL provides two data types that help us implement full-text search. They allow us to search through a collection of texts and find the ones that match a given query the most.

tsvector

The tsvector column stores the text in a format optimized for search. To parse a string into the tsvector format, we need the to_tsvector function.

SELECT to_tsvector('english', 'The quick brown fox quickly jumps over the lazy dog');

tsvector.pngWhen we look at the result of the above query, we notice a set of optimizations. One of the most apparent is grouping duplicates. Thanks to using the English dictionary, PostgreSQL noticed that “quick” and “quickly” are two variants of the same word.

Also, using the tsvector type can help us filter out stop words. They are very common, appear in almost every sentence, and don’t have much value when searching through text. Since we used the English dictionary in the above example, PostgreSQL filtered out the words “the” and “over”.

tsquery

The tsquery data type stores the text we want to search for. To transform a string into the tsquery format, we can use the to_tsquery function.

SELECT to_tsquery('fox');

tsquery.png

To check if a certain tsvector matches the tsquery, we need to use the @@ operator.

SELECT to_tsvector('english', 'The quick brown fox quickly jumps over the lazy dog') @@ to_tsquery('fox');

When doing the above, we can play with the &, |, and ! boolean operators. For example, we can use the ! operator to make sure a given text does not contain a particular word.

SELECT to_tsvector('english', 'The quick brown fox quickly jumps over the lazy dog') @@ to_tsquery('!cat');

Check out the official documentation for a good explanation of all available operators.

Another handy function is plainto_tsquery. It takes an unformatted phrase and inserts the & operator between words. Because of that, it is an excellent choice to handle the input from the user.

SELECT to_tsvector('english', 'The quick brown fox quickly jumps over the lazy dog') @@ plainto_tsquery('brown fox');

Transforming the existing data

Let’s take a look at our posts table.

CREATE TABLE posts (
  id int GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
  title text NOT NULL,
  post_content text NOT NULL,
  author_id int REFERENCES users(id) NOT NULL

Unfortunately, it does not contain a tsvector column. The most straightforward solution to the above problem is to convert our data to tsvector on the fly.

SELECT * FROM posts
WHERE to_tsvector('english', post_content) @@ plainto_tsquery('fox');

We can take the above even further and combine the contents of the title and post_content columns to search through both.

SELECT * FROM posts
WHERE to_tsvector('english', post_content || ' ' || title) @@ plainto_tsquery('fox');

The crucial issue with the above approach is that it causes PostgreSQL to transform the text from every record of the posts database, which can take a substantial amount of time.

Instead, I suggest defining a generated column that contains the data transformed into the tsvector format.

ALTER TABLE posts
ADD COLUMN text_tsvector tsvector GENERATED ALWAYS AS (
  to_tsvector('english', post_content || ' ' || title)
) STORED

If you want to know moure about generated columns, check out Defining generated columns with PostgreSQL and TypeORM

Since we use the STORED keyword, we define a stored generated column that is saved in our database. PostgreSQL updates it automatically every time we modify the post_content and title columns.

We can now use our generated column when making a SELECT query to improve its performance drastically.

SELECT * FROM posts
WHERE text_tsvector @@ plainto_tsquery('fox');

Ordering the results

So far, we haven’t paid attention to the order of the results of our SELECT query. Sorting the search results based on relevance could help the users quite a bit.

For example, we can indicate that the text from the title column is more important than the post_content column. To do that, let’s change how we create our text_tsvector column and use the setweight function.

ALTER TABLE posts
ADD COLUMN text_tsvector tsvector GENERATED ALWAYS AS (
  setweight(to_tsvector('english', title), 'A') ||
  setweight(to_tsvector('english', post_content), 'B')
) STORED

Let’s compare the two following posts after modifying the text_tsvector column:

post_vector_1.png

post_vector_2.png

The combined value of the title and post_content is the same in both posts. However, the text_tsvector takes into account that the title column is more important.

Thanks to the above, we can now use the ts_rank function to order our results based on the weight of each column.

SELECT * FROM posts
WHERE text_tsvector @@ plainto_tsquery('brown fox')
ORDER BY ts_rank(text_tsvector, plainto_tsquery('brown fox')) DESC

Implementing full-text search with NestJS

Let’s create a migration first to implement the above functionalities in our NestJS project.

npx knex migrate:make add_post_tsvector
20221113211441_add_post_tsvector.ts
import { Knex } from 'knex';
export async function up(knex: Knex): Promise<void> {
  await knex.raw(`
    ALTER TABLE posts
    ADD COLUMN text_tsvector tsvector GENERATED ALWAYS AS (
      setweight(to_tsvector('english', title), 'A') ||
      setweight(to_tsvector('english', post_content), 'B')
    ) STORED
  return knex.raw(`
    CREATE INDEX post_text_tsvector_index ON posts USING GIN  (text_tsvector)
export async function down(knex: Knex): Promise<void> {
  return knex.raw(`
    ALTER TABLE posts
    DROP COLUMN text_tsvector

The crucial thing to notice above is that we are creating a Generalized Inverted Index (GIN). It works well with text searching and is appropriate when a column contains more than one value. Doing that can speed up our SELECT queries very significantly.

If you want to know more about indexes, check out API with NestJS #82. Introduction to indexes with raw SQL queries

In one of the previous parts of this series, we implemented the support for the search query parameter.

posts.controller.ts
import {
  ClassSerializerInterceptor,
  Controller,
  Query,
  UseInterceptors,
} from '@nestjs/common';
import { PostsService } from './posts.service';
import GetPostsByAuthorQuery from './getPostsByAuthorQuery';
import PaginationParams from '../utils/paginationParams';
import SearchPostsQuery from './searchPostsQuery';
@Controller('posts')
@UseInterceptors(ClassSerializerInterceptor)
export default class PostsController {
  constructor(private readonly postsService: PostsService) {}
  @Get()
  getPosts(
    @Query() { authorId }: GetPostsByAuthorQuery,
    @Query() { search }: SearchPostsQuery,
    @Query() { offset, limit, idsToSkip }: PaginationParams,
    return this.postsService.getPosts(
      authorId,
      offset,
      limit,
      idsToSkip,
      search,
  // ...

Finally, we need to modify the SQL queries that we make in our repository.

posts.repository.ts
import { Injectable } from '@nestjs/common';
import DatabaseService from '../database/database.service';
import PostModel from './post.model';
@Injectable()
class PostsSearchRepository {
  constructor(private readonly databaseService: DatabaseService) {}
  async search(
    offset = 0,
    limit: number | null = null,
    idsToSkip = 0,
    searchQuery: string,
    const databaseResponse = await this.databaseService.runQuery(
      WITH selected_posts AS (
        SELECT * FROM posts
        WHERE id > $3 AND text_tsvector @@ plainto_tsquery($4)
        ORDER BY id ASC
        OFFSET $1
        LIMIT $2
      total_posts_count_response AS (
        SELECT COUNT(*)::int AS total_posts_count FROM posts
        WHERE text_tsvector @@ plainto_tsquery($4)
      SELECT * FROM selected_posts, total_posts_count_response
      [offset, limit, idsToSkip, searchQuery],
    const items = databaseResponse.rows.map(
      (databaseRow) => new PostModel(databaseRow),
    const count = databaseResponse.rows[0]?.total_posts_count || 0;
    return {
      items,
      count,
  // ...
export default PostsSearchRepository;

Above, we use the keyset pagination that prevents us from sorting the results in a straightforward way. If you want to know more, check out API with NestJS #77. Offset and keyset pagination with raw SQL queries

Summary

In this article, we’ve gone through implementing full-text search in PostgreSQL. To do that, we had to learn about the tsvector and tsquery data types. In addition, we’ve created a stored generated column and a Generalized Inverted Index to improve the performance. By doing all of the above, we’ve created a fast search mechanism that is a good fit for many applications.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK