3

Which embedded OLAP database to pick?

 1 year ago
source link: https://www.zaynetro.com/post/2023-playing-with-olap/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Intro🔗

OLAP databases

I am creating a project that handles visitor events from browsing a website. I need to build an analytical dashboard for reporting purposes. There has been a lot of talk lately that you might not need any cloud solutions and you could survive with an embedded database on a single instance. Is that true?

In this post I will explore how SQLite compares to DuckDB, Polars and Apache DataFusion.

Note: This is not a benchmark! This is me exploring different solutions using their default interfaces.

If you wish to see the results you can jump directly to them.

For my test I am going to generate random events that happen during a visitor session. Let's say a visitor goes to a website and interacts with it. Then they might open a few more pages and interact with them or not. For the purposes of this experiment there are only three distinct events that might happen:

  • Page load: { "path": "/", "user_agent": "firefox" }
  • Chat message: { "text": "Hello!" }
  • Form submission:
    • { "form_type": "feedback", "fields": [{ "name": "score", "value": "80" }] }
    • OR { "form_type": "contact-us", "fields": [{ "name": "email", "value": "a@b" }] }

Every event has a unique id, a session id and a page id. Page id and session id are used to group events from the same page and from the same session accordingly.

Let's define event schema for SQLite:

CREATE TABLE events (
  id TEXT,
  session_id TEXT,
  page_id TEXT,
  timestamp TEXT,
  event_type TEXT,
  payload TEXT
);

Payload is a JSON string.

SQLite, MySQL and co will store events in a row-oriented storage:

idsession_idpage_idtimestampevent_typepayload
id-1s-1p-12023-04-18 12:00:00page_load{ "path": "/", ...
id-2s-1p-12023-04-18 12:01:00chat_message{ "text": "Hello!" }
..................

This format works great for inserts and fetching individual rows.

Online analytical processing (OLAP) needs are different though. For OLAP you need to query large amounts of data and return grouped results of individual fields. Column-oriented storage works much in favour.

What is column-oriented storage? Our example above will be roughly stored as series of values per field:

idid-1id-2...
session_ids-1s-1...
page_idp-1p-1...
timestamp2023-04-18 12:00:002023-04-18 12:01:00...
event_typepage_loadchat_message...
payload{ "path": "/", ...{ "text": "Hello!" }...

This format allows for storage saving and query execution optimizations like:

  1. Timestamp values could record a difference from the previous value, a delta instead of storing a complete date.
  2. Queries that select a single field need to read only a single row.

Column-oriented storage references:

So far we have been treating payload as a JSON string. In fact OLAP databases support structs and lists so we can define a typed schema for our payload. Let's use DuckDB for schema definition in SQL:

CREATE TABLE events (
  id VARCHAR,
  session_id VARCHAR,
  page_id VARCHAR,
  timestamp TIMESTAMP,
  event_type VARCHAR,
  payload STRUCT(
    path VARCHAR,
    user_agent VARCHAR,
    text VARCHAR,
    form_type VARCHAR,
    fields STRUCT(name VARCHAR, value VARCHAR)[]
  )
);

Where to you follow tech updates?

Queries comparison🔗

Querying DuckDB is exactly the same as querying SQLite. Let's try to find the most visited pages:

SELECT payload->>'$.path' AS path, COUNT(*) AS count
  FROM events
 WHERE
     event_type = 'page_load'
 GROUP BY path
 ORDER BY count DESC
 LIMIT 5

This query is the same for SQLite and DuckDB where payload field as JSON. For typed payload field we only need to modify field extraction in the SELECT-case:

-SELECT payload->>'$.path' AS path, COUNT(*) AS count
+SELECT payload.path AS path, COUNT(*) AS count

Now let's take a look at "Average feedback score" query. This query requires extracting a struct field, a first list element and also casting a string to a number.

SQLite is by far the simplest. It also does the casting for us:

SELECT AVG(payload->>'$.fields[0].value') AS average
  FROM events
 WHERE
     event_type = 'form_submit'
     AND payload->>'$.form_type' = 'feedback';

Typed DuckDB:

SELECT AVG(TRY_CAST(payload.fields[1].value AS INTEGER)) AS average
  FROM events
 WHERE
     event_type = 'form_submit'
     AND payload.form_type = 'feedback';

Polars:

let pres = df
    .filter(
        col("event_type").eq(lit("form_submit"))
        .and(col("payload")
                .struct_()
                .field_by_name("form_type")
                .eq(lit("feedback")),
        ),
    )
    .select([
        // '$.fields[0].value
        col("payload")
            .struct_().field_by_name("fields")
            .arr().first()
            .struct_().field_by_name("value")
            .cast(DataType::Int32)
            .alias("score"),
    ])
    .select([avg("score")])
    .collect()?;

Results🔗

Queries were run on Macbook Air M2 from an application written in Rust. Source code for the tests.

2M events🔗

My data generator inserted 2'274'231 event rows. Let's see file sizes:

FileSize
SQLite593M
DuckDB (JSON payload)190M
DuckDB (Typed payload)177M
Parquet (Typed payload)171M
CSV555M
JSON701M

You can see how column-oriented storage solutions decrease the file size. Parquet is ~70% smaller than SQLite!

Now to the queries:

QuerySQLiteDuckDB1PolarsDataFusion
Count by event_type700ms10ms25ms80ms
Average page loads per session255ms10ms45ms115ms
Average feedback score255ms35ms160ms2
Top pages370ms30ms190ms460ms
Page loads per day235ms5ms20ms55ms
Form submissions485ms40ms225ms550ms
Form submissions by page535ms60ms380ms535ms

22M events🔗

My data generator inserted 22'754'423 event rows. Let's see file sizes:

FileSize
SQLite5.8G
DuckDB (JSON payload)1.7G
DuckDB (Typed payload)1.7G
Parquet (Typed payload)1.7G

Now to the queries:

QuerySQLiteDuckDB1(Typed)3PolarsDataFusion
Count by event_type10350ms70ms200ms775ms
Average page loads per session5180ms95ms470ms1170ms
Average feedback score4845ms500ms2185ms2
Top pages7820ms455ms (350ms)2070ms4655ms
Page loads per day4890ms70ms165ms520ms
Form submissions8020ms345ms (355ms)2972ms5830ms
Form submissions by page12340ms620ms (650ms)6980ms9380ms
1

DuckDB results with payload as JSON string.

2

DataFusion doesn't fully support nested structs: https://github.com/apache/arrow-datafusion/issues/2179

3

In parenthesis you can see DuckDB results with typed payload.

Notes about results:🔗

  • I used Polars and DataFusion as crates. At least for Polars there are additional features that could be enabled to increase the performance which I haven't tried.
  • I have included DuckDB results with payload as JSON string. Typed DuckDB is consistently 1-2ms faster on a smaller dataset. On a larger dataset typed DuckDB was slower on some queries.
  • Polars supports SQL queries as well. I just wanted to play around with DataFrame API in Rust.
  • Looking at H2O.ai benchmark Polars should be as fast as DuckDB. Maybe I am doing something wrong.
  • SQL is a very portable interface. SQLite, DuckDB and DataFusion queries are almost the same.

Feelings🔗

Overall, it was a nice experiment. I really liked using DuckDB and Polars. Polars feels to be targeted more towards Python users as Rust documentation is a bit lacking. DuckDB uses 1-based indexing on arrays. It took me some time to realize it though :/

Honestly, I am impressed with how well DuckDB performed especially with payload as JSON string. I really like the flexibility of just using JSON and not worrying about the schema.

All projects that I played with are built around Apache Parquet and Apache Arrow. Andy Pavlo described the future of OLAP databases really well:

The long-term trend to watch is the proliferation of frameworks like Velox, DataFusion, and Polars. Along with projects like Substrait, the commoditization of these query execution components means that all OLAP DBMSs will be roughly equivalent in the next five years. Instead of building a new DBMS entirely from scratch or hard forking an existing system (e.g., how Firebolt forked Clickhouse), people are better off using an extensible framework like Velox. This means that every DBMS will have the same vectorized execution capabilities that were unique to Snowflake ten years ago. And since in the cloud, the storage layer is the same for everyone (e.g., Amazon controls EBS/S3), the critical differentiator between DBMS offerings will be things that are difficult to quantify, like UI/UX stuff and query optimization.
Ref: https://ottertune.com/blog/2022-databases-retrospective/

If you are not sold to the idea of storing Apache Parquet files on S3 and using an embedded DB engine to query them then you might try ClickHouse.

Open questions🔗

  • I am yet to find a good way to write Parquet files. Using parquet crate directly feels a bit complicated (I just can't comprehend the docs). Both Polars and DuckDB could write Parquet files. Shall I stick to the same approach I used for this experiment: write to DuckDB and export from it or is there a better solution?
  • DuckDB and Polars could read Apache Parquet files directly from S3. DuckDB also claims that they will download only file parts that are necessary to facilitate the query. I haven't tried an S3 integration but that is definitely something I would like to test.
  • Can Polars run faster?

If you have any tips or answers, please, do let me know!


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK