pg_analytics: Transforming Postgres into a Very Fast Analytical Database
source link: https://docs.paradedb.com/blog/introducing_analytics
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
ParadeDB’s latest extension, pg_analytics
, makes ParadeDB the world’s fastest Postgres-based analytical
database. With pg_analytics
installed, ParadeDB is 94x faster than regular Postgres, 8x faster than
Elasticsearch, and nearly ties ClickHouse on analytical benchmarks1.
Today, developers who store billions of data points in Postgres struggle with slow query times and poor data compression. Even with database tuning, complex analytical queries (e.g. counts, window functions, string aggregations) can take anywhere from minutes to hours. Many organizations turn to an external analytical data store like Elasticsearch as a result. This increases operational complexity as data becomes siloed and engineers must learn to use a new database.
By accelerating analytical queries directly inside Postgres, pg_analytics
is a drop-in solution for analytics in Postgres without the need to extract, transform, and
load (ETL) data into another system. The goal of this blog post is to share how pg_analytics
was
built and why now is an unprecedented time for building a Postgres-based analytical database.
How It Works
Regular Postgres tables, known as heap tables, organize data by row. While this makes sense for operational
data, it is inefficient for analytical queries, which often scan a large amount of data from a subset of the
columns in a table. pg_analytics
solves this by integrating Apache Arrow, a column-oriented data format,
and Apache DataFusion, an embeddable query engine for Arrow, within Postgres.
The extension uses two features of the Postgres API — executor hooks and the table access method. The table access method is responsible for creating a new kind of table that can emit Arrow data batches and receive analytical queries. Executor hooks intercept and reroute queries to DataFusion, which creates an optimal query plan, executes it, and returns the results to Postgres.
Data is persisted to disk with Parquet, a highly-compressed file format for column-oriented data. Thanks to Parquet, ParadeDB compacts data 5x more than both regular Postgres and Elasticsearch.
The final dependency is delta-rs
, a Rust-based implementation of Delta Lake. This library adds ACID
transactions, updates and deletes, and file compaction to Parquet storage. It also supports querying over data
lakes like S3, which introduces the future possibility connecting Postgres tables to cloud data lakes.
Why DataFusion
The history of analytical databases in Postgres extends back to 2005 with a project called Greenplum. Since then, several companies like Citus and Timescale have released similar products.
Twenty years later, the performance gap between these databases and their non-Postgres, OLAP counterparts is wide. This is one reason that systems like Elasticsearch are popular even among companies that prefer Postgres.
Recently, embeddable query engines like DataFusion have changed the game by surpassing the query speed of many OLAP databases. DataFusion teases the idea of excellent analytical performance from any database — including Postgres.
Andy Pavlo, professor of databases at Carnegie Mellon, was right. Today, we’ve reached a point where it does not make sense to build a query engine from scratch within a database. Instead, the next generation of analytical databases should integrate existing, embeddable query engines2 like DataFusion that can continuously improve the database as the engine itself improves.
Getting Started
At the time of writing, pg_analytics
is open source and in an MVP state. Almost all Postgres queries and
basic operations like inserts and vacuums are supported. Our roadmap can be found in the project
README.
The easiest way to try pg_analytics
is by running the ParadeDB Docker image. Once connected, you can follow
this toy example.
CREATE EXTENSION pg_analytics;
-- Create a deltalake table
CREATE TABLE t (a int) USING deltalake;
-- pg_analytics supercharges the performance of any
-- Postgres query run on a deltalake table
INSERT INTO t VALUES (1), (2), (3);
SELECT COUNT(*) FROM t;
The core ParadeDB team is focused on making pg_analytics
production-ready.
We welcome community contributions and are active on Slack.
Finally, please don’t hesitate to show your support by giving us a star!
Footnotes
-
According to Clickbench, a benchmarking tool for analytical databases.
-
We also evaluated DuckDB, Polars, and Velox as candidates for an embedded query engine. DuckDB is a popular in-process OLAP database, Polars is a dataframe processing library built on Arrow, and Velox is a query execution library built by Meta. DataFusion was chosen for three reasons. First, it interoperates with a storage framework like Delta Lake, which provides essential properties like ACID transactions. Secondly, its API was intended to be embedded inside another database, unlike standalone databases like Polars and DuckDB. Finally, it’s written in Rust and comes with a query parser and optimizer, unlike Velox.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK