Apache Druid 25.0 Delivers Multi-Stage Query Engine and Kubernetes Task Management

Jan 19, 2023 2 min read

Apache Druid is a high-performance real-time datastore and its latest release, version 25.0, provides many improvements and enhancements. The main new features are: the multi-stage query (MSQ) task engine used for SQL-based ingestion is now production ready; Kubernetes can be used to launch and manage tasks eliminating the need for middle managers; simplified deployment; and a new dedicated binary for Hadoop 3.x users.

In order to produce real-time analytics and reduce time to insight for a variety of use cases, Druid's design incorporates concepts from data warehouses, time-series databases, and search systems.

It has a microservice-based distributed architecture that is designed to be cloud-ready and comprises several types of services such as: Coordinator service that manages data availability on the cluster, Overlord service that controls the assignment of data ingestion workloads, Broker service that handles queries from external clients and MiddleManager services that ingest data.

The image below shows the architecture of Apache Druid:

During the ingestion phase, Druid reads the data from the source system and stores it in data files called segments. In general, segment files contain a few million rows each. Every segment file is partitioned by time and organized in a columnar structure stored separately to decrease query latency by scanning only those columns actually needed for a query.

Druid supports both streaming and batch ingestion. It connects to a source of raw data, typically a message bus such as Apache Kafka (for streaming data loads), or a distributed file system, such as HDFS or cloud-based storage like Amazon S3 and Azure Blob Storage (for batch data loads), and can convert raw data to a more read-optimized format (segment) in a process called "indexing" Apache Druid can ingest denormalized data in JSON, CSV, Parquet, Avro and other custom formats.

It is possible to query data in Druid data sources using Druid SQL. Druid translates SQL queries into its native query language.

Druid comes with a web console that may be used to load data, manage data sources and tasks, and control server status and segment information. Additionally, you can execute SQL and native Druid queries in the console.

The image below shows the web console of Druid:

For situations where real-time ingest, fast query performance and high uptime are crucial, Apache Druid is frequently employed.

As a result, Druid is commonly used as a backend for highly concurrent APIs that require quick aggregations or to power the GUIs of analytical apps. Druid works best with event-oriented data.

Typical application areas are: Clickstream analytics (web and mobile analytics), Risk/fraud analysis, Network telemetry analytics (network performance monitoring), Application performance metrics and Business intelligence / OLAP.

It is used by many big players like Airbnb, British Telecom, Cisco, eBay, Expedia, Netflix and Paypal and has more than 12k stars on Github.

About the Author

Andrea Messetti

Andrea is a software architect at DXC Technology. Previously he worked at HP. Andrea is currently focusing on Java, cloud-native applications and microservices. He is passionate about every aspect related to Computer Science (ML, Blockchain, edge computing).

Apache Druid 25.0 Delivers Multi-Stage Query Engine and Kubernetes Task Manageme...

Apache Druid 25.0 Delivers Multi-Stage Query Engine and Kubernetes Task Management

About the Author

Andrea Messetti

Recommend

Apache Druid vs. Time-Series Databases

Introducing Apache Druid 0.18.0

A Close Look at the Workings of Apache Druid

Workspace ONE 2111 and macOS – Freestyle Orchestrator Now Automates Mac Manageme...

What should I do if I cannot login to the Mercusys Range Extender's web manageme...

Query folders base on created secondary type on folders of BTP document manageme...

Nutanix and Microsoft partner to simplify infrastructure deployment and manageme...

The Robotics Revolution: The Impact of Advanced Automation on Inventory Manageme...

Understanding license type Basic vs Advanced for SAP Extended Warehouse Manageme...

Introducing the SAC Cockpit Widget – Version Management, Auditing, Task Manageme...

About Joyk