Datadog Creates Scalable Data Ingestion Architecture

Jun 16, 2023 2 min read

Datadog created a dedicated data ingestion architecture offering exactly-once semantics for their third-generation event store, Husky. The event-driven architecture (EDA) can accommodate bursts in traffic in the multi-tenant platform with reasonable ingestion latency and acceptable operational costs.

Datadog launched Husky in 2022, learning from their experiences running two different architectures previously, which the company has outgrown with more clients using the platform and new products with specific data storage and query requirements.

The architecture of Husky separated data ingestion, data compaction, and data reading workloads, which allows them to be scaled independently. All three workloads leverage a shared metadata store built on FoundationDB and a blob storage service that uses AWS S3. The data ingestion workload uses Apache Kafka to deliver events into the storage platform and route them internally to data writers.

Source: https://www.datadoghq.com/blog/engineering/introducing-husky/

Daniel Intskirveli, a senior software engineer at Datadog, explains the unique challenges for efficient data ingestion solution:

While it [Husky] can perform point lookups and run needle-in-the-haystack search queries, it’s not designed to perform point lookups at high volume and with low latency. This design posed a challenge on the ingestion side: how can we guarantee that data is ingested into Husky exactly once, ensuring that there are never duplicate events?

Exactly-once ingestion semantics is crucial for Datadog as duplicate events can cause false positives or false negatives in alerting monitor evaluations and would skew usage reporting that drives customer billing.

The solution is an internal routing mechanism that deterministically splits the incoming stream of events into multiple shards for each tenant. Events from tenant shards can then be ingested in the storage engine by downstream write workers (or writers) responsible for in-memory event deduplication. This approach makes deduplication easy due to the locality of tenant data within the shard which offers better performance. Since the tenant data is isolated at the storage level (stored in separate files), the routing mechanism limits the number of tenants included in a shard to lower storage costs.

Source: https://www.datadoghq.com/blog/engineering/husky-deep-dive/

Write workers (or writers) consume events from assigned shards and persist these events to make them queryable. Based on previous experience, the team opted for a stateless design of writers to enable auto-scaling and load-balancing. To support event deduplication, stateless writers must save previously processed event IDs into a persistent datastore. Event IDs are inserted into separate tables in FoundationDB and committed in a single transaction with event metadata, ensuring atomicity and consistency. Additionally, they are cached in memory by writers using an LRU (least recently used) cache.

The design supports conflict detection and resolution when a shard gets reassigned to another worker in case of a scale-up event, redeployment, or restart due to an infrastructure issue. Using an optimistic concurrency control, updates to event-ID tables are versioned, and any out-of-order updates are rejected. When the worker detects a conflict, it refreshes the event-ID cache from the FoundationDB table and resets the offset in the Kafka topic.

About the Author

Rafal Gancarz

Rafal is an experienced technology leader and expert. He's currently helping Starbucks make its Commerce Platform scalable, resilient and cost-effective. Previously, Rafal has been involved in designing and building large-scale, distributed and cloud-based systems for Cisco, Accenture, Capita, ICE, Callsign and others. His interests span architecture & design, continuous delivery, observability and operability, as well as sociotechnical and organisational aspects of software delivery.

Datadog Creates Scalable Data Ingestion Architecture

Datadog Creates Scalable Data Ingestion Architecture

About the Author

Rafal Gancarz

Recommend

Spotify plans to charge extra for high-fidelity audio, which Apple Music include...

Apple expands Self Service Repair and updates System Configuration process

Japan looks to force Apple to allow third-party app stores for iPhone

Musk Continues to Skirt Controversy as Twitter Works to Reassure Regulatory Grou...

Tesla Dojo supercomputer is finally coming next month

61%的美国人认为 AI会威胁到人类

Java News Roundup: GraalVM 23.0.0, Payara Platform, Spring 6.1-M1, QCon New York

基金经理：下半年人工智能仍将是主角之一，但结构上或有显著变化

Google Accuses Microsoft of Cloud Bullying in FTC Response

Many of Apple's products and features are powered by AI

About Joyk