6

Datadog Creates Scalable Data Ingestion Architecture

 1 year ago
source link: https://www.infoq.com/news/2023/06/datadog-husky-data-ingestion/?itm_source=infoq&itm_medium=popular_widget&itm_campaign=popular_content_list&itm_content=
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Datadog Creates Scalable Data Ingestion Architecture

Jun 16, 2023 2 min read

Datadog created a dedicated data ingestion architecture offering exactly-once semantics for their third-generation event store, Husky. The event-driven architecture (EDA) can accommodate bursts in traffic in the multi-tenant platform with reasonable ingestion latency and acceptable operational costs.

Datadog launched Husky in 2022, learning from their experiences running two different architectures previously, which the company has outgrown with more clients using the platform and new products with specific data storage and query requirements.

The architecture of Husky separated data ingestion, data compaction, and data reading workloads, which allows them to be scaled independently. All three workloads leverage a shared metadata store built on FoundationDB and a blob storage service that uses AWS S3. The data ingestion workload uses Apache Kafka to deliver events into the storage platform and route them internally to data writers.

1datadog-husky-architecture-1686691944650.jpeg

Source: https://www.datadoghq.com/blog/engineering/introducing-husky/

Daniel Intskirveli, a senior software engineer at Datadog, explains the unique challenges for efficient data ingestion solution:

While it [Husky] can perform point lookups and run needle-in-the-haystack search queries, it’s not designed to perform point lookups at high volume and with low latency. This design posed a challenge on the ingestion side: how can we guarantee that data is ingested into Husky exactly once, ensuring that there are never duplicate events?

Exactly-once ingestion semantics is crucial for Datadog as duplicate events can cause false positives or false negatives in alerting monitor evaluations and would skew usage reporting that drives customer billing.

The solution is an internal routing mechanism that deterministically splits the incoming stream of events into multiple shards for each tenant. Events from tenant shards can then be ingested in the storage engine by downstream write workers (or writers) responsible for in-memory event deduplication. This approach makes deduplication easy due to the locality of tenant data within the shard which offers better performance. Since the tenant data is isolated at the storage level (stored in separate files), the routing mechanism limits the number of tenants included in a shard to lower storage costs.

1datadog-husky-ingestion-1686691944650.jpeg

Source: https://www.datadoghq.com/blog/engineering/husky-deep-dive/

Write workers (or writers) consume events from assigned shards and persist these events to make them queryable. Based on previous experience, the team opted for a stateless design of writers to enable auto-scaling and load-balancing. To support event deduplication, stateless writers must save previously processed event IDs into a persistent datastore. Event IDs are inserted into separate tables in FoundationDB and committed in a single transaction with event metadata, ensuring atomicity and consistency. Additionally, they are cached in memory by writers using an LRU (least recently used) cache.

The design supports conflict detection and resolution when a shard gets reassigned to another worker in case of a scale-up event, redeployment, or restart due to an infrastructure issue. Using an optimistic concurrency control, updates to event-ID tables are versioned, and any out-of-order updates are rejected. When the worker detects a conflict, it refreshes the event-ID cache from the FoundationDB table and resets the offset in the Kafka topic.

About the Author

Rafal Gancarz

Rafal is an experienced technology leader and expert. He's currently helping Starbucks make its Commerce Platform scalable, resilient and cost-effective. Previously, Rafal has been involved in designing and building large-scale, distributed and cloud-based systems for Cisco, Accenture, Capita, ICE, Callsign and others. His interests span architecture & design, continuous delivery, observability and operability, as well as sociotechnical and organisational aspects of software delivery.

Show more

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK