![](/style/images/good.png)
![](/style/images/bad.png)
Apache Druid 25.0 Delivers Multi-Stage Query Engine and Kubernetes Task Manageme...
source link: https://www.infoq.com/news/2023/01/druid-analytics-database/?itm_source=infoq&itm_medium=popular_widget&itm_campaign=popular_content_list&itm_content=
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Apache Druid 25.0 Delivers Multi-Stage Query Engine and Kubernetes Task Management
Jan 19, 2023 2 min read
Apache Druid is a high-performance real-time datastore and its latest release, version 25.0, provides many improvements and enhancements. The main new features are: the multi-stage query (MSQ) task engine used for SQL-based ingestion is now production ready; Kubernetes can be used to launch and manage tasks eliminating the need for middle managers; simplified deployment; and a new dedicated binary for Hadoop 3.x users.
In order to produce real-time analytics and reduce time to insight for a variety of use cases, Druid's design incorporates concepts from data warehouses, time-series databases, and search systems.
It has a microservice-based distributed architecture that is designed to be cloud-ready and comprises several types of services such as: Coordinator service that manages data availability on the cluster, Overlord service that controls the assignment of data ingestion workloads, Broker service that handles queries from external clients and MiddleManager services that ingest data.
The image below shows the architecture of Apache Druid:
![1druid-architecture-1674040443713.png](https://www.infoq.com/news/2023/01/druid-analytics-database/news/2023/01/druid-analytics-database/en/resources/1druid-architecture-1674040443713.png)
During the ingestion phase, Druid reads the data from the source system and stores it in data files called segments. In general, segment files contain a few million rows each. Every segment file is partitioned by time and organized in a columnar structure stored separately to decrease query latency by scanning only those columns actually needed for a query.
Druid supports both streaming and batch ingestion. It connects to a source of raw data, typically a message bus such as Apache Kafka (for streaming data loads), or a distributed file system, such as HDFS or cloud-based storage like Amazon S3 and Azure Blob Storage (for batch data loads), and can convert raw data to a more read-optimized format (segment) in a process called "indexing" Apache Druid can ingest denormalized data in JSON, CSV, Parquet, Avro and other custom formats.
It is possible to query data in Druid data sources using Druid SQL. Druid translates SQL queries into its native query language.
Druid comes with a web console that may be used to load data, manage data sources and tasks, and control server status and segment information. Additionally, you can execute SQL and native Druid queries in the console.
The image below shows the web console of Druid:
![1ui-annotated-1674040443713.png](https://www.infoq.com/news/2023/01/druid-analytics-database/news/2023/01/druid-analytics-database/en/resources/1ui-annotated-1674040443713.png)
For situations where real-time ingest, fast query performance and high uptime are crucial, Apache Druid is frequently employed.
As a result, Druid is commonly used as a backend for highly concurrent APIs that require quick aggregations or to power the GUIs of analytical apps. Druid works best with event-oriented data.
Typical application areas are: Clickstream analytics (web and mobile analytics), Risk/fraud analysis, Network telemetry analytics (network performance monitoring), Application performance metrics and Business intelligence / OLAP.
It is used by many big players like Airbnb, British Telecom, Cisco, eBay, Expedia, Netflix and Paypal and has more than 12k stars on Github.
About the Author
Andrea Messetti
Andrea is a software architect at DXC Technology. Previously he worked at HP. Andrea is currently focusing on Java, cloud-native applications and microservices. He is passionate about every aspect related to Computer Science (ML, Blockchain, edge computing).
Show moreRecommend
-
17
We occasionally get questions regarding how Apache Druid differs from time-series databases (TSDB) such as InfluxDB or Prometheus, and when to use each technology. This short post serves to help answer these questions....
-
23
The Apache Druid community released Druid 0.18 on April 20th, 2020. This release contains over 200 new features, performance enhancements, bug fixes, and major documentation improvements from 42 contributors. As a...
-
3
Gabriel-Mihai Ruiu Apache Druid is a real-time analytics database that bridges the possibility of persisting large amounts of data with that of being able...
-
5
Workspace ONE 2111 and macOS – Freestyle Orchestrator Now Automates Mac Management With the recent
-
4
To work with a system, users have to be able to control and assess the state of the system. With a web interface, it is easy for users to configure and manage the range extende...
-
2
Jacky Liu August 28, 2022 2 minute read ...
-
3
Nutanix and Microsoft partner to simplify infrastructure deployment and management to support modern applications ...
-
3
The Robotics Revolution: The Impact of Advanced Automation on Inventory ManagementJune 15th 2023 New Story by
-
7
Introduction In this blog post I will bring your attention for a very common question raised by customers and about the features available on SAP Extended Warehouse Management for license type Basic and Advanced. It means, understa...
-
10
Diana M August 20, 2023 8 minute read
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK