8

pg_analytics: Transforming Postgres into a Very Fast Analytical Database

 7 months ago
source link: https://docs.paradedb.com/blog/introducing_analytics
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

ParadeDB’s latest extension, pg_analytics, makes ParadeDB the world’s fastest Postgres-based analytical database. With pg_analytics installed, ParadeDB is 94x faster than regular Postgres, 8x faster than Elasticsearch, and nearly ties ClickHouse on analytical benchmarks1.

Today, developers who store billions of data points in Postgres struggle with slow query times and poor data compression. Even with database tuning, complex analytical queries (e.g. counts, window functions, string aggregations) can take anywhere from minutes to hours. Many organizations turn to an external analytical data store like Elasticsearch as a result. This increases operational complexity as data becomes siloed and engineers must learn to use a new database.

By accelerating analytical queries directly inside Postgres, pg_analytics is a drop-in solution for analytics in Postgres without the need to extract, transform, and load (ETL) data into another system. The goal of this blog post is to share how pg_analytics was built and why now is an unprecedented time for building a Postgres-based analytical database.

How It Works

Regular Postgres tables, known as heap tables, organize data by row. While this makes sense for operational data, it is inefficient for analytical queries, which often scan a large amount of data from a subset of the columns in a table. pg_analytics solves this by integrating Apache Arrow, a column-oriented data format, and Apache DataFusion, an embeddable query engine for Arrow, within Postgres.

The extension uses two features of the Postgres API — executor hooks and the table access method. The table access method is responsible for creating a new kind of table that can emit Arrow data batches and receive analytical queries. Executor hooks intercept and reroute queries to DataFusion, which creates an optimal query plan, executes it, and returns the results to Postgres.

Data is persisted to disk with Parquet, a highly-compressed file format for column-oriented data. Thanks to Parquet, ParadeDB compacts data 5x more than both regular Postgres and Elasticsearch.

The final dependency is delta-rs, a Rust-based implementation of Delta Lake. This library adds ACID transactions, updates and deletes, and file compaction to Parquet storage. It also supports querying over data lakes like S3, which introduces the future possibility connecting Postgres tables to cloud data lakes.

Why DataFusion

The history of analytical databases in Postgres extends back to 2005 with a project called Greenplum. Since then, several companies like Citus and Timescale have released similar products.

Twenty years later, the performance gap between these databases and their non-Postgres, OLAP counterparts is wide. This is one reason that systems like Elasticsearch are popular even among companies that prefer Postgres.

Recently, embeddable query engines like DataFusion have changed the game by surpassing the query speed of many OLAP databases. DataFusion teases the idea of excellent analytical performance from any database — including Postgres.

Andy Pavlo, professor of databases at Carnegie Mellon, was right. Today, we’ve reached a point where it does not make sense to build a query engine from scratch within a database. Instead, the next generation of analytical databases should integrate existing, embeddable query engines2 like DataFusion that can continuously improve the database as the engine itself improves.

Getting Started

At the time of writing, pg_analytics is open source and in an MVP state. Almost all Postgres queries and basic operations like inserts and vacuums are supported. Our roadmap can be found in the project README.

The easiest way to try pg_analytics is by running the ParadeDB Docker image. Once connected, you can follow this toy example.

CREATE EXTENSION pg_analytics;
-- Create a deltalake table
CREATE TABLE t (a int) USING deltalake;
-- pg_analytics supercharges the performance of any
-- Postgres query run on a deltalake table
INSERT INTO t VALUES (1), (2), (3);
SELECT COUNT(*) FROM t;

The core ParadeDB team is focused on making pg_analytics production-ready. We welcome community contributions and are active on Slack. Finally, please don’t hesitate to show your support by giving us a star!

Footnotes

  1. According to Clickbench, a benchmarking tool for analytical databases.

  2. We also evaluated DuckDB, Polars, and Velox as candidates for an embedded query engine. DuckDB is a popular in-process OLAP database, Polars is a dataframe processing library built on Arrow, and Velox is a query execution library built by Meta. DataFusion was chosen for three reasons. First, it interoperates with a storage framework like Delta Lake, which provides essential properties like ACID transactions. Secondly, its API was intended to be embedded inside another database, unlike standalone databases like Polars and DuckDB. Finally, it’s written in Rust and comes with a query parser and optimizer, unlike Velox.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK