4

Quick Pattern-Matching Queries in PostgreSQL and YugabyteDB

 1 year ago
source link: https://dzone.com/articles/quick-pattern-matching-queries-in-postgresql-and-y
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

This tutorial shows how to optimize pattern-matching requests in PostgreSQL and YugabyteDB by applying several indexing strategies. Since YugabyteDB is built on PostgreSQL source code, the strategies work for both single-server PostgreSQL deployments and multi-node YugabyteDB clusters.

Loading Sample Data

Let’s begin by building out an example.  Assume you have a table that stores user profiles. One of the columns is the user’s first name. Then, let’s find all the users whose name starts with An and is an arbitrary length. For that, we can use a pattern-matching query like  WHERE firstname LIKE ‘An%’.

First, let’s create a table with sample data.

1. Start by connecting to your PostgreSQL or YugabyteDB instance using your preferred method. As an option, you can do this in Docker within a minute.

2. Create a table that will store the names, and how frequently a name was used for newborns:

create table firstname (
  name text, 
  frequency int
);

3. Download sample data from here with 32,000+ first names.

4. Load the sample data to the table:

\copy firstname FROM '{path_to_file}/us_first_names.csv' csv header;

If you search for all the names that start with An, the databases will do a full table scan of 32,000+ records and return 665 records that satisfy the search criteria:

explain analyze select name, frequency from firstname where name like 'An%';

 Seq Scan on firstname  (rows=665 loops=1)
    Filter: (name ~~ 'An%'::text)
   Rows Removed by Filter: 32287

Note: all execution plans are shortened for clarity and provided for PostgreSQL only. YugabyteDB’s execution plans can look, or be formatted, differently because its storage engine is distributed and uses the LSM tree instead of B-tree (used in PostgreSQL).

The full table scan doesn’t look like a problem now. But imagine that new users keep signing up for your service, making the table bigger and bigger.

So, now you have thousands of users, but in three months you might have tens of thousands of active customers. In a year or two, the number can grow exponentially. To keep the latency of the earlier pattern-matching query predictable and low, regardless of the data volume, you need to use a proper database index.

Search With Default Index

Let’s try to improve the search by creating an index with default settings for the firstname column:

1. Create the index and update statistics for PostgreSQL. YugabyteDB doesn’t have (and need) the vacuum analyze command as long as its distributed storage works differently:

create index name_idx on firstname(name);
vacuum analyze; /* PostgreSQL only. */ 

2. Confirm the index is selected for a simple query using the equality operator:

explain analyze select name, frequency from firstname where name like 'Andrew';

 Index Scan using name_idx on firstname  (rows=2 loops=1)
    Index Cond: (name = 'Andrew'::text)
   Filter: (name ~~ 'Andrew'::text)

3. Finally, search for names that start with An:

explain analyze select name, frequency from firstname where name like 'An%';

Seq Scan on firstname  (rows=665 loops=1)
   Filter: (name ~~ 'An%'::text)
   Rows Removed by Filter: 32287

As the final execution plan shows, the current solution failed to use the index on my macOS laptop. It’s highly likely that the created index won’t be used on your machine as well. Let’s figure out why.

Search With “C” Collation

PostgreSQL and YugabyteDB rely on collations when it comes to the indexing of character string data types. A collation is a set of rules that define how to compare and sort character strings (refer to this article or video for more details on the topic).

Let’s talk about PostgreSQL first. By default, PostgreSQL uses locale-specific collation rules of your operating system. Usually, those rules don’t work for queries involving the LIKE or similar operators. My default collation on PostgreSQL is en_US.utf8, and it’s not suitable for pattern-matching requests:

show lc_collate;

 lc_collate
------------
 en_US.utf8

A suggested solution is to use the “C” collation that sorts and compares character strings strictly byte by byte. You can specify this collation for the whole database, on a per-column level, or while creating an index. Let’s use it for the index in PostgreSQL:

1. Drop the current index with the default operator class:

drop index name_idx;

2. Create a new index using the “C” collation:

create index name_idx on firstname(name collate "C");
vacuum analyze;

3. Search for all the names that start with An:

explain analyze select name, frequency from firstname where name like 'An%';

 Bitmap Heap Scan on firstname  (rows=665 loops=1)
   Filter: (name ~~ 'An%'::text)
   Heap Blocks: exact=88
   ->  Bitmap Index Scan on name_idx (665 loops=1)
         Index Cond: ((name >= 'An'::text) AND (name < 'Ao'::text))

Success! PostgreSQL used the index and found the 665 names that satisfy the search pattern.

In YugabyteDB, the situation is different. YugabyteDB already uses the “C” collation by default, so there is nothing to change here:

show lc_collate;
 lc_collate
------------
 C
(1 row)

However, the index was created using the HASH algorithm which is not suitable for pattern-matching requests:

\d+ firstname;
                                   Table "public.firstname"
  Column   |  Type   | Collation | Nullable | Default | Storage  | Stats target | Description
-----------+---------+-----------+----------+---------+----------+--------------+-------------
 name      | text    |           |          |         | extended |              |
 frequency | integer |           |          |         | plain    |              |

Indexes:
    "name_idx" lsm (name HASH)

To get the index selected for pattern-matching requests in YugabyteDB, we need to build the range index:

1. Drop the current index:

drop index name_idx;

2. Create a new index by specifying the ASC operator (ascending order). The operator enables the range indexing of data:

create index name_idx on firstname(name ASC);

3. Repeat the previous pattern-matching search:

explain analyze select name, frequency from firstname where name like 'An%';

 Index Scan using name_idx on firstname  (rows=665 loops=1)
 	Index Cond: ((name >= 'An'::text) AND (name < 'Ao'::text))
 Filter: (name ~~ 'An%'::text)

Job done! Now our pattern-matching request uses the index in both PostgreSQL and YugabyteDB.

One thing to keep in mind that applies to both databases: the index access is used only for characters preceding the first wildcard. Characters that go after the wildcard will be used for further sequential filtering over a result set prepared during the index traversal.

For instance, let’s say that you want to find all the names that start with An and end with a:

explain analyze select name, frequency from firstname where name like 'An%a';
 
Bitmap Heap Scan on firstname  (rows=220 loops=1)
   Filter: (name ~~ 'An%a'::text)
   Rows Removed by Filter: 445
   Heap Blocks: exact=85
   ->  Bitmap Index Scan on name_idx  (rows=665 loops=1)
         Index Cond: ((name ~>=~ 'An'::text) AND (name ~<~ 'Ao'::text))
  • The index is used for the characters before the wildcard - see the line Index Cond: ((name ~>=~ 'An'::text) AND (name ~<~ 'Ao'::text))
  • Then the prepared result set will be filtered sequentially - see the line Filter: (name ~~ 'An%a'::text)

Note on pattern_ops Operator

A quick note on the text_pattern_ops operator class (or varchar_pattern_ops) that is still suggested by many Internet resources as a go-to solution for pattern-matching requests in PostgreSQL.

This operator class can also be used for fast lookups via the index:

1. Create a new index specifying the text_pattern_ops class:

/* In PostgreSQL */
create index name_idx on firstname(name text_pattern_ops);
vacuum analyze;


/*In YugabyteDB */
create index name_idx on firstname(name text_pattern_ops ASC);

2. Execute the query with the LIKE operator:

explain analyze select name, frequency from firstname where name like 'An%';

 Bitmap Heap Scan on firstname  (rows=665 loops=1)
   Filter: (name ~~ 'An%'::text)
   Heap Blocks: exact=85
   ->  Bitmap Index Scan on name_idx  (rows=665 loops=1)
         Index Cond: ((name ~>=~ 'An'::text) AND (name ~<~ 'Ao'::text))

As you see, the text_pattern_ops does work, but many consider it a legacy solution. One limitation is that this operator class doesn’t support queries involving ordinary <, <=, >, or >= comparisons. You can learn more about the topic here.

Case Insensitive Search

What if you need to do a case-insensitive search using the ILIKE operator, or another method? Then the previously discussed solution using the “C” collation (as well as the text_pattern_ops class) won’t be sufficient.

As before, let’s find all the names that start with An (or an, AN, aN) using the case-insensitive ILIKE operator:

explain analyze select name, frequency from firstname where name ilike 'an%';

Seq Scan on firstname (rows=665 loops=1)
   Filter: (name ~~* 'an%'::text)
   Rows Removed by Filter: 32287

The query returns the same 665 first name but didn’t use the existing index. There are several solutions that support case-insensitive index lookups. One of them is trigrams that are backed by GiST or GIN indexes.

PostgreSQL (and inherently YugabyteDB) support the trigrams-based lookups via the pg_trgm extension. Let’s install it and use the extension for case-insensitive pattern-matching queries:

1. Drop the current index:

drop index name_idx;

2. Enable the pg_trgm extension:

create extension pg_trgm;

3. Create a new GIN index with the trigrams operator class:

create index name_idx on firstname using gin(name gin_trgm_ops);
vacuum analyze; /* PostgreSQL only. */ 

4. Perform a case-insensitive search for all the names that start with An/an/aN/AN:

explain analyze select name, frequency from firstname where name ilike 'an%';

 Bitmap Heap Scan on firstname (rows=665 loops=1)
   Recheck Cond: (name ~~* 'an%'::text)
   Heap Blocks: exact=85
   ->  Bitmap Index Scan on name_idx  (rows=665 loops=1)
         Index Cond: (name ~~* 'an%'::text)

This time both PostgreSQL and YugabyteDB used the index lookup.

Finally, as a bonus, with the trigrams-based index, you can run pattern-matching queries that start with a wildcard. For instance, if you need to find all the names that end with an, then just use the %an expression, and the databases will still pick the index:

explain analyze select name, frequency from firstname where name ilike '%an';

 Bitmap Heap Scan on firstname  (rows=1530 loops=1)
   Recheck Cond: (name ~~* '%an'::text)
   Heap Blocks: exact=175
   ->  Bitmap Index Scan on name_idx (rows=1530 loops=1)
         Index Cond: (name ~~* '%an'::text)

Summary 

As you see from the steps in this example, database indexing capabilities are very rich. You can create an optimal index for every business operation: you just need to study what types of indexes a database supports, and match those capabilities to your use cases.

If you’d like to complete this tutorial from start to finish and see how the databases behave, use these Docker commands to get PostgreSQL and YugabyteDB ready on your personal machine in under a minute.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK