4

Apache Druid 25.0 Delivers Multi-Stage Query Engine and Kubernetes Task Manageme...

 1 year ago
source link: https://www.infoq.com/news/2023/01/druid-analytics-database/?itm_source=infoq&itm_medium=popular_widget&itm_campaign=popular_content_list&itm_content=
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Apache Druid 25.0 Delivers Multi-Stage Query Engine and Kubernetes Task Management

Jan 19, 2023 2 min read

Apache Druid is a high-performance real-time datastore and its latest release, version 25.0, provides many improvements and enhancements. The main new features are: the multi-stage query (MSQ) task engine used for SQL-based ingestion is now production ready; Kubernetes can be used to launch and manage tasks eliminating the need for middle managers; simplified deployment; and a new dedicated binary for Hadoop 3.x users.

In order to produce real-time analytics and reduce time to insight for a variety of use cases, Druid's design incorporates concepts from data warehouses, time-series databases, and search systems.

It has a microservice-based distributed architecture that is designed to be cloud-ready and comprises several types of services such as: Coordinator service that manages data availability on the cluster, Overlord service that controls the assignment of data ingestion workloads, Broker service that handles queries from external clients and MiddleManager services that ingest data.

The image below shows the architecture of Apache Druid:

1druid-architecture-1674040443713.png

During the ingestion phase, Druid reads the data from the source system and stores it in data files called segments. In general, segment files contain a few million rows each. Every segment file is partitioned by time and organized in a columnar structure stored separately to decrease query latency by scanning only those columns actually needed for a query.

Druid supports both streaming and batch ingestion. It connects to a source of raw data, typically a message bus such as Apache Kafka (for streaming data loads), or a distributed file system, such as HDFS or cloud-based storage like Amazon S3 and Azure Blob Storage (for batch data loads), and can convert raw data to a more read-optimized format (segment) in a process called "indexing" Apache Druid can ingest denormalized data in JSON, CSV, Parquet, Avro and other custom formats.

It is possible to query data in Druid data sources using Druid SQL. Druid translates SQL queries into its native query language.

Druid comes with a web console that may be used to load data, manage data sources and tasks, and control server status and segment information. Additionally, you can execute SQL and native Druid queries in the console.

The image below shows the web console of Druid:

1ui-annotated-1674040443713.png

For situations where real-time ingest, fast query performance and high uptime are crucial, Apache Druid is frequently employed.

As a result, Druid is commonly used as a backend for highly concurrent APIs that require quick aggregations or to power the GUIs of analytical apps. Druid works best with event-oriented data.

Typical application areas are: Clickstream analytics (web and mobile analytics), Risk/fraud analysis, Network telemetry analytics (network performance monitoring), Application performance metrics and Business intelligence / OLAP.

It is used by many big players like Airbnb, British Telecom, Cisco, eBay, Expedia, Netflix and Paypal and has more than 12k stars on Github.

About the Author

Andrea Messetti

Andrea is a software architect at DXC Technology. Previously he worked at HP. Andrea is currently focusing on Java, cloud-native applications and microservices. He is passionate about every aspect related to Computer Science (ML, Blockchain, edge computing).

Show more

Recommend

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK