2

Scaling INET queries in POSTGRESQL

 8 months ago
source link: https://blog.davidvassallo.me/2023/11/29/scaling-inet-queries-in-postgresql/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Scaling INET queries in POSTGRESQL

image.png?w=648&h=9999

A very common requirement in most SOCs is to compare a given IP address against a table to extract contextual information such as originating country, owning organization and so on. At CyberSift we had a similar challenge. Most geo providers will allow you to download data to a table in this format:

CREATE TABLE "ip_loc" (
"start_ip" inet,
"end_ip" inet,
"iso2" varchar(4),
"state" varchar(100),
"city" varchar(100)
);

You then need to query the table for that row where your given IP address is between start_ip and end_ip. The problem is, at last count there are 11,481,758 rows in that table – and that is just for IPv4. I can’t imagine what will happen with IPv6

Your first (correct) thought would be to create an index on start_ip and end_ip, then use something like:

SELECT * FROM ip_loc WHERE inet(host('8.8.8.8')) between start_ip and end_ip LIMIT 1;

ASIDE: the above is one way of doing it… you could use other operators to the same effect.

Close…. but the query is still fantastically slow, taking in the order of a couple seconds to complete. At CyberSift’s scale, this is a problem. Using EXPLAIN on the query shows us why:

           QUERY PLAN                                                 
--------------------------------------
 Limit  (cost=0.00..0.68 rows=1 width=52)
   ->  Seq Scan on ip_loc (cost=0.00..867875.74 rows=1275751 width=52)
[truncated output]

A sequential scan… even though we have an index covering those columns?

Why? Because even if you use an index on those two columns, there is no guarantee that postgres will leverage that index. If too many rows are expected to be returned (NB: not actually returned) by the query planner, then it will use a sequential scan of rows, which we want to avoid at all costs

To solve this, we used one pre-processing step. When loading our data we merge the two columns into a single subnet and store that in another inet column. This is easy to do even after the intial table is created by doing:

ALTER TABLE ip_loc ADD COLUMN subnet inet;
CREATE INDEX ON ip_loc USING gist (subnet inet_ops);
UPDATE ip_loc SET SUBNET = inet_merge(start_ip, end_ip);

Note the creation of an index on the new column, and populating it using the magical inet_merge function 🙂

Now your queries can be changed to querying “contained in the subnet” rather than “between two inets”, which from a query analyzer point of view is expected to return a lot fewer rows – so it will use the index we created above, leading to much faster queries, for example:

select * from ip_loc where '92.56.6.3'::inet << subnet;

EXPLAIN shows the change:

image.png?w=1024

We’re back down to a couple milliseconds 🎉🎉

References

Loading...

Related


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK