The Four Innovation Phases of Netflix’s Trillions Scale Real-time Data Infrastructure

My name is Zhenzhong Xu. I joined Netflix in 2015 as a founding engineer on the Real-time Data Infrastructure team and later led the Stream Processing Engines team. I developed an interest in real-time data in the early 2010s, and ever since believe there is much value yet to be uncovered.

Netflix was a fantastic place to be surrounded by many amazing colleagues. I can’t be more proud of everyone involved in this journey to turn a shared belief into reality. I want to take a brief moment to share the team’s major accomplishments:

We grew the streaming data use cases from 0 to over 2000+ across all organizations in Netflix.
We built and evolved successful products such as Keystone, managed Flink Platform, Mantis, and managed Kafka platform. These products offer solutions in many aspects of the data ecosystem: including data ingestion, movement, analytical & operational processing, and machine learning use cases.
We were among the first in the industry to scale open-source Kafka & Flink deployment to handle 1 Trillion events a day around 2017 and later scaled another 20x by 2021!

A couple of months ago, I left Netflix to pursue a similar but expanded vision in the real-time ML space. I think it’s a perfect time to summarize my learnings building the real-time data infrastructure at Netflix. I hope this post will help platform engineers develop their cloud-native, self-serve streaming data platforms and scale use cases across many business functions (not necessarily from our success but maybe more from our failures). I also believe that having an understanding of how data platforms are built can help platform users (e.g., data scientists and ML engineers) make the most out of their platforms.

What’s In the Post

I will share the four phases of Real-time Data Infrastructure’s iterative journey in Netflix (2015–2021).

Phase 1: Rescue Netflix Logs From the Failing Batch Pipelines (2015)

During Netflix’s global hyper-growth, business and operational decisions are reliant on faster logging data more than ever. In 2015, the Chukwa/Hadoop/Hive-based batch pipelines struggled to scale. In this phase, we built a streaming-first platform from the ground up to replace the failing pipelines.

Phase 2: Scale 100s Data Movement Use Cases (2016)

After the initial product release, the internal demand for data movement steadily rose. We had to focus on the common use cases. In this phase, we scaled to support 100s of use cases by building out a self-service, fully managed platform with a simple-yet-powerful building block design.

Phase 3: Support Custom Needs and Scale Beyond 1000s Use Cases (2017–2019)

As Stream Processing gained momentum in Netflix, many teams demanded lower latency and more processing flexibility in Data Engineering, Observability, and Machine Learning spaces. In this phase, we built a new Stream Processing development experience to enable custom use cases, and we also tackled the new engineering and operational challenges.

Phase 4: Expand Stream Processing Responsibilities — Challenges and Opportunities ahead (2020 — Present)

With the fast-paced Data Platform technology evolution in the industry, there are many new challenges: difficulties in coordination, steep learning curve, divisive stream-to-batch boundaries, etc. In this phase, we explored Stream Processing playing a more prominent role in connecting technologies and raising abstractions to make Data Platforms easier to use. There are many more opportunities ahead of us.

For each phase, I will go over the evolving business motivations, the team’s unique challenges, the strategy bets, and the use case patterns we discovered along the way.

Acknowledgment

Many people helped me review this post, without whose feedback I would never dig into many unbiased details. Special thanks to Chip Huyen, Steven Wu, Prashanth Ramdas, Guanhua Jiang, Scott Shi, Goku Mohandas, David Sun, Astasia Myers, and Matt Willian!

The Four Phases of Real-time Data Infrastructure at Netflix

Phase 1: Rescue Netflix Logs From the Failing Batch Pipelines (2015)

Context

In 2015, Netflix already had about 60MM subscribers and was aggressively expanding its international presence. We all knew that promptly scaling the platform leverage would be the key to sustaining the skyrocketing subscriber growth.

Side note: For those unfamiliar, the platform team provides leverage by centrally managing the foundational infrastructures so the product team can focus on business logic.

Our team had to figure out how to help Netflix scale the logging practices. At that time, Netflix had ~500 microservices, generating more than 10PB data every day in the ecosystem.

Collecting these data serves two primary purposes for Netflix:

Gain business analytics insights (e.g., user retention, average session length, what’s trending, etc.).
Gain operation insights (e.g., measuring streaming plays per second to quickly and easily understand the health of Netflix systems), so developers can alert or perform mitigations.

You may ask why we need to move the logs from the edge to the data warehouse in the first place? Due to the sheer volume, it was infeasible to do on-demand analytics on the online transactional databases at scale. The reason is that online transactional processing (OLTP) and online analytical processing (OLAP) are built with different assumptions — OLTP was built for row-oriented access patterns and OLAP for column-oriented access patterns. Underneath the hood, they use different data structures for optimizations.

For example, let’s say we want to know the average session length across hundreds of millions of Netflix users. Suppose we put this on-demand analytical query on a row-oriented OLTP system. It will result in full table scans at the row-level granularity and potentially lock up the database, and applications may become unresponsive, resulting in unpleasant user experiences. This type of analytics/reporting workload is much better off done on an OLAP system, hence the need to move the logs reliably in a low latency fashion.

(Figure: Move data from Edge to Data Warehouse)

By 2015 logging volume had increased to 500 billion events/day (1PB of data ingestion), up from 45 billion events/day in 2011. The existing logging infrastructure (a simple batch pipeline platform built with Chukwa, Hadoop, and Hive) was failing rapidly against the alarmingly increasing subscriber numbers every week. By estimation, we had about six months to evolve towards a streaming-first solution. The below diagrams show the evolution from the failing batch architecture to the new streaming-based architecture.

(Figure: the failing batch pipeline architecture, before migration)

We decided to replace this failing infrastructure with Keystone.

(Figure: the Keystone streaming architecture, after migration)

Another question you may have is why consider a streaming-first architecture? At the time, the value of getting the streaming architecture right outweighs the potential risks. Netflix is a data-driven company, and the immediate importance of a streaming architecture are:

Reduce developer and operation feedback loop. Developers heavily rely on looking at logs to make operational decisions. Access to fresher on-demand logs allows developers to detect issues earlier, hence better productivity.
Better product experience. Many product features, such as personalized recommendation, search, etc., can benefit from fresher data to improve user experiences, leading to higher user retention, engagement, etc.

Challenges

We had to think through many challenges in building a streaming platform.

Challenge 1: Large scale with limited resources. We had six months to build out Keystone to handle 500B events per day, and we had to pull it off with six team members.

Challenge 2: Immature streaming ecosystem. Developing and operating streaming-first infrastructure was difficult in 2015 when both transport (Apache Kafka) and processing technologies (Apache Samza, Apache Flink) were relatively nascent. Few technology companies had proven successful streaming-first deployments at the scale we needed, so we had to evaluate technology options and experiment. Given our limited resources, we couldn’t build everything on our own and had to decide what to build and what nascent tools to bet on.

Challenge 3: Analytical and operational concerns are different.

Analytical stream processing focuses on correctness and predictability. For example, moving all user clickstreams into the data warehouse requires data consistency (minimum duplicates or losses) and predictability on latency that typically falls in the minutes range. (Keystone is good at this)
Operational stream processing focuses more on cost-effectiveness, latency, and availability. For example, understanding the health state of the entire Netflix device ecosystem can benefit from sub-second to seconds level latency, and the data can be sampled or profiled from the source to save cost. (Mantis is good at this)

Challenge 4: Cloud-native resilience for a stateful data platform is hard. Netflix had already operated on the AWS cloud for a few years. However, we were the first to get a stateful data platform onto the containerized cloud infrastructure. There are hundreds of thousands of physical machines powering the cloud in each data center underneath the hood. At this scale, hardware failures are inevitable. When these failures arise unexpectedly, it could be challenging for the systems to keep up with the availability and consistency expectations. It’s even more challenging in an unbounded low latency stream processing environment where any failure recovery can result in backpressure buildup. Cloud-native resilience for a streaming-first architecture would mean significant engineering challenges.

(Figure: How does stream processing help with operational and analytical data)

Stream Processing Patterns Summary in Phase 1

I include some observed use cases and their respective stream processing patterns for each innovation phase. So you can get a sense of the evolution over time.

+-----------------------------------------------------------------+
| Pattern               | Product  | Example Use Cases            |
|-----------------------|----------|------------------------------|
| Data Routing          | Keystone | Logging, Data Movement (MVP) |
|-----------------------|----------|------------------------------|
| RT Alerts / Dashboard | Mantis   | SPS Alert                    |
+-----------------------------------------------------------------+

Strategy Bets

Bet 1: When building the MVP product, build it for the first few customers. When exploring the initial product-market fit, it’s easy to get distracted. We decided only to help a few high-priority high-volume internal customers and worry about scaling the customer base later. This decision not only enabled us to be laser-focused on the core of the product. It also made us conscious about what not to invest in (e.g., we used a spreadsheet instead of a full-fledged control plane to keep customer names, email addresses, and metadata information per pipeline to facilitate customer deployment during the MVP phase).

Bet 2: Co-evolve with technology partners even if not in an ideal maturity state instead of reinventing the wheels on our own. This way, we could co-evolve the ecosystems together. We chose to partner early on.

Streaming partners: Externally, we worked with partners who were leading the stream processing efforts in the industry, e.g., LinkedIn (Kafka and Samza team), Confluent, Data Artisan (builders behind Apache Flink, and later renamed to Veverica). Doing so allowed us to contribute to the OSS for our needs while leveraging the community’s work.
Containerization partners: in 2015, it was still the early days of container virtualization technology. We needed a strategy to deploy quickly. Internally, we partnered up with the newly created container infrastructure Titus team. Titus was built on top of Apache Mesos and provided compute resource management, scheduling, and isolated deployment, via an abstracted API. Titus later evolved to leverage K8S in early 2020, and their team managed to migrate all workloads transparently. Thanks to this partnership, we didn’t have to worry about the low-level compute infrastructure while building the data platform.

During the collaboration, we stayed in constant communication to share learnings and provide feedback. We had biweekly sync meetings with close partners to align on goals and discuss issues and solutions. When there were blocking issues, the right people would be involved right away.

Bet 3: Decouple concerns versus ignoring them.

Separated concerns between operational and analytics use cases. We evolved Mantis (operations-focused) and Keystone (analytics-focused) separately but created room to interface both systems.

（Figure: separation of concerns for different stream processing scenarios)

Separated concerns between producers and consumers. We introduced producer/consumer clients equipped with standardized wire protocol and simple schema management to help decouple the development workflow of producers and consumers. It later proved to be an essential aspect in Data Governance and data quality control.
Separated component responsibilities. We started with a microservice-oriented single responsibility principle, dividing the entire infrastructure into Messaging (Streaming Transport), Processing (Stream Processing), and Control Plane (the control plane was only a script in this phase, later evolved into a full-fledged system). The separation of component responsibilities enabled the team to align on interfaces early on while unlocking the team’s productivity by focusing on different parts concurrently.

Bet 4: Invest in building a system that expects failures and monitors all operations versus delaying the investment. Due to the elastic, constantly-changing, higher-failure-probability characteristics of the cloud, we need to design the system to monitor, detect, and tolerate all sorts of failures. The failures can range from network blips, instance failure, zone failure, cluster failure, inter-service congestion & backpressure, regional disaster failures, etc. We invested in building the systems with the assumption that failures are constant. We embraced DevOps practices early on: such as design for failure scenarios, rigorous automation, continuous deployment, shadow testing, automated monitoring, alerting, etc. This DevOps foundation enabled us the ultimate engineering agility to ship multiple times a day.

(reference: What is Chaos Money in DevOps on Quora)

Learnings

Having a psychologically safe environment to fail is essential for any team to lead changes.

We made many mistakes. We failed miserably on the product launch day and had a company-wide incident that caused massive data loss. After investigation, it turned out that despite correctly estimating the traffic growth, the monstrous Kafka cluster (with over 200 brokers) we built out was too big and eventually hit its scaling limit. When one broker died, the cluster couldn’t recover on its own due to the limitations in Kafka (at the time) regarding broker and partition numbers. It eventually degraded to the point of no recovery.

It was a terrifying experience to fail at this scale, but thanks to the psychologically safe team environment, we communicated transparently with all the stakeholders and turned the quick learnings into permanent solutions. For this particular case, we developed more granular cluster isolation capabilities (smaller Kafka clusters + isolated Zookeeper) to contain the blast radius.

There were many other failures. As we dug through the root cause, we realized that we wouldn’t be able to fully anticipate all edge scenarios, especially when we were building on a managed cloud where changes often go outside of our control, such as instance terminations or tenant colocation, etc. At the same time, our product is used by many and is too important to fail. It became an operational principle always to prepare our operations for the worst-case scenario.

After the incident, we adopted a weekly Kafka cluster failover drill. Every week the oncall person would simulate a Kafka cluster failure. The team would ensure the failover automation works to migrate all traffic to a healthy cluster with minimum user impact. If you are interested in learning more about this practice, this video has more details.

Phase 2: Scale 100s Data Movement Use Cases (2016)

Context

After shipping the initial Keystone MVP and migrating a few internal customers, we gradually gained some trust and words spread to other engineering teams. Streaming gained momentum in Netflix because it’s now easy to move logs for analytical processing and gain on-demand operational insights.

It was time for us to scale for general customers.

(Figure: evolving Keystone architecture diagram, circa 2016. Keystone includes Kafka and Flink engines as its core components. For more technical design details, please refer to blog posts focused on Kafka and Flink)

Challenges

Challenge 1: Increased operation burden. We initially offered white-glove assistance to onboard new customers. However, it quickly became unsustainable given the growing demand. We needed to start evolving the MVP to support more than just a dozen customers. As a result, we need to rebuild some components (e.g., it’s time to turn the spreadsheet into a proper database-backed control plane).

Challenge 2: Diverse needs arise. As we got more customer requests, we started seeing very diverse needs. There were two major categories:

One group preferred a fully managed service that’s simple to use.
The other group preferred flexibility and needed complex computation capabilities to solve more advanced business problems, and they are ok with taking pagers and even managing some infrastructures.

We can’t do both well at the same time.

Challenge 3: We broke everything we touched. No kidding, we broke pretty much all our dependent services at some point due to the scale. We broke aws S3 once. We found many bugs in Kafka and Flink. We broke Titus (the managed container infrastructure) multiple times and discovered weird CPU isolation and network issues. We broke Spinnaker (the continuous deployment platform) when we started to manage hundreds of customer deployments programmatically.

Luckily those teams were also the best. They worked with us to solve these problems one by one. These efforts are critical to maturing the entire streaming ecosystem.

Stream Processing Patterns Summary in Phase 2

+-----------------------------------------------------------------+
| Pattern               | Product  | Example Use Cases            |
|-----------------------|----------|------------------------------|
| Data Routing          | Keystone | Logging, Data Movement       |
|                       |          | (+ At scale)                 |
|-----------------------|----------|------------------------------|
| RT Data Sampling/     | Mantis   | Cost-effective RT Insights   |
| Discovery             |          |                              |
|-----------------------------------------------------------------|
| RT Alerts / Dashboard | Mantis,  | SPS Alert,                   |
|                       | Kafka    | + Infrastructure Health      |
|                       |          | Monitoring (Cassandra &      |
|                       |          | Elasticsearch),              |
|                       |          | +RT QoE monitoring           |
+-----------------------------------------------------------------+

Strategy Bets

Bet 1: Focus on simplicity first versus exposing infrastructure complexities to users. We decided to first focus on a highly abstracted, fully managed service for the general streaming use cases for two reasons.

It’d allow us to address most data movement and simple streaming ETL (e.g., projection, filtering) use cases. Providing such simple, high-level abstraction for data routing would enable engineers across all Netflix organizations to use the data routing as a “lego” building block in conjunction with other Platform services.
It’d enable our users to focus on business logic.

We’d tackle the more advanced use cases for later.

Bet 2: Invest in a fully-managed multi-tenancy self-service versus continuing with manual white-glove support. We had to focus on automating the control plane and workload deployment. Customer workloads need to be fully isolated. We decided that one customer’s workload should not interfere with another customer’s in any way.

(Figure: Keystone UI. showing a self-serve drag-n-drop experience powered by a fully managed multi-tenant streaming infrastructure)

Bet 3: Continue to invest in DevOps versus delaying it. We wanted to ship platform changes multiple times a day as needed. We also believe it’s necessary to enable our customers to ship changes anytime they want. Deployment should be done automatically and rolled into production safely minutes after the customer initiated it.

Learnings

Learning 1: Deciding what NOT to work on is hard, but necessary. While it’s important to address customer requests, some can be distracting. Prioritization is the first step, but consciously deciding and communicating what to cut is even more critical. Saying no is hard. But be aware saying no is temporary, and saying yes is permanent.

Learning 2: Pay attention to scaling velocity. It’s an exciting time after the initial product-market fit is validated. However, scaling too fast would risk the team being distracted from multiple directions, leaving behind loads of tech debt, and breaking customer trust. Scaling too slowly would leave the team unmotivated, customer needs unfulfilled for too long, also breaking their trust. It’s a delicate balance, but here are some signals you can watch out for:

Software quality. Does the deployment rollback frequency change? How often do the partner teams block the team? Do tests fail more often now? How often do incidents happen due to any system bottleneck?
Customer sentiment. Is the increase in customer support requests sublinear to the number of use cases? SLO violations trends. Do customers show excitement in new feature announcements? When there is an “urgent” ask from the customer, what alternatives have they considered?
Operational overhead. Does the team’s operation-to-development time ratio change? Does the RCA-to-incident ratio change? Is the team burned out from operation toil? Does the team’s innovation frequency change (e.g., blog posts, conference talks)?

Learning 3: Educate your users and be patient in correcting misconceptions. There were many stream processing misconceptions around data quality like event losses or duplications, or around processing semantics like correctness guarantees under failure scenarios. Many of these misconceptions came from an old-time when Stream Processing was not mature, but a lot has evolved. Be patient with your users and educate them with data and stories!

Phase 3: Support Custom Needs and Scale Beyond 1000s Use Cases (2017–2019)

Context

A year after initially launching the Keystone product, the number of use cases quickly rose from less than a dozen in 2015 to a few hundred in 2017 across all organizations. We had built a solid operational foundation at this point: customers were rarely notified during their oncalls, and all infrastructure issues were closely monitored and handled by the platform team. We had built a robust delivery platform, helping our customers to introduce changes into production just minutes after the intent.

Keystone product was very good at what it was originally designed to do: a streaming data routing platform that’s easy to use and almost infinitely scalable. However, it was becoming apparent that the full potential of stream processing was far from being realized. To satisfy custom use cases, we constantly stumble upon new needs for more granular control on complex processing capabilities, e.g., streaming joins, windowing, etc.

Meanwhile, Netflix has a unique freedom and responsibility culture where each team is empowered to make its own technical decisions. This culture enables each team to even invest in bespoke solutions. As a central platform team, we noticed such growth. If we don’t have a way to provide the coverage, it would mean high long-term costs to the company.

It was time for the team to expand its scope. Again, we faced some new challenges.

Challenges

Challenge 1: Custom use cases require a different developer & operation experience.

Let me first give two examples of custom stream processing use cases.

Compute streaming ground truth for Recommendation. For the Netflix Recommendation algorithm to provide the best possible experience, it’s necessary to train the model with the latest data. One of the inputs to training the model is the Label dataset. The labels are the direct ground-truth indicator of whether the previous personalized predictions were accurate or not. If the user decides to watch a movie for a recommendation, we have a positive label. As you can guess, the faster we can get this Label dataset, the quicker the entire ML feedback loop. To compute the labels, we’ll need to join together the impressions stream and the user-clicks stream. However, user click-activity can often arrive late. For example, users sometimes take a few minutes to decide or simply leave their devices on without watching for hours. The use case requires the streaming pipeline to emit the labels as soon as all relevant information arrives but is still tolerant of late-arriving information.
Compute take-fraction for Recommendation. Netflix makes personalized recommendations to optimize the user experience. One of which is an algorithm to select the best-personalized artwork (and the best location to show them) to optimize the user take-fraction. Underneath the hood, a stream processing pipeline computes this take-fraction metric in near-real-time by joining the playback data stream with the impression stream over some custom window. Due to the scale of hundreds of millions of Netflix’s user base, the streaming job needs to constantly checkpoint the internal state ranging between 1–10 TB.

(Figure: A/B test to select the best artwork to personalize to the user,

reference: Selecting the best artwork for videos through A/B testing | by Netflix)

These use cases involve more advanced stream processing capabilities, such as complex event/processing time and window semantics, allowed lateness, large-state checkpoint management. They also require more operational support around observability, troubleshooting, and recovery. A brand new developer experience is necessary, including more flexible programming interfaces and operation capabilities such as custom observability stack, backfill capabilities, and a proper infrastructure to manage 10s of TBs local states. We didn’t have these in Keystone, and we need to build a new product entry point but minimize redundant investment.

Challenge 2: Balancing between Flexibility & Simplicity. With all the new custom use cases, we had to figure out the proper level of control exposure. We can go all the way to expose the lowest-level API at the tradeoff of much more challenging operations (because we would never be able to fully anticipate how users would use the engine). Or we can choose to go mid-way (e.g., expose limited functionality) at the risk of unsatisfied customers.

Challenge 3: Increased operation complexity. Supporting the custom use cases required us to increase the degree of freedom of the platform. As a result, we needed to improve operation observability in many complex scenarios. At the same time, as the platform integrated with many other data products, the touchpoints of our system increased, demanding operational coordination with other teams to serve our collective customers better.

Challenge 4: Central Platform vs. Local Platform. Our team’s responsibility was to provide a centralized stream processing platform. But due to the previous strategy to focus on simplicity, some teams have already invested in their local stream processing platforms using unsupported technology, e.g. Spark Streaming. We’ll have to convince them to move back to a paved path because they risk losing leverage from the platform and wasting resources on redundant investment. Now is the right time as we expand into custom use cases.

Stream Processing Patterns Summary in Phase 3

+-----------------------------------------------------------------+
| Pattern               | Product  | Example Use Cases            |
|-----------------------|----------|------------------------------|
| Stream-to-stream Joins| Flink    | Take-fraction computation,   |
| (ETL)                 |          | Recsys label computation     |
|-----------------------|----------|------------------------------|
| Stream-to-table joins | Flink    | Side input: join streams with|
| (ETL)                 |          | slow-moving Iceberg table    |
|-----------------------|----------|------------------------------|
| Streaming Sessionizat-| Flink    | Personalization Sessionizat- |
| ion (ETL)             |          | ion, Metrics sessionization  |
|-----------------------|----------|------------------------------|
| RT Observability      | Mantis   | Distributed tracing,         |
|                       |          | Chaos EXPER monitoring,      |
|                       |          | Application monitoring       |
|-----------------------|----------|------------------------------|
| RT Anomaly / Fraud    | Mantis,  | Contextual alert,            |
| Detection             | Flink    | PII detection,               |
|                       |          | Fraudulent login prevention  |
|-----------------------|----------|------------------------------|
| RT DevOps Decision    | Mantis   | Autoscaling,                 |
| Tool                  |          | Streaming ACA & A/B tests,   |
|                       |          | CDN placement optimization   |
|-----------------------|----------|------------------------------|
| Event Sourced         | Flink    | Content Delivery Network     |
| Materialized View     |          | snapshotting                 |
+-----------------------+----------+------------------------------+

Strategy Bets

Bet 1: Build a new product entry point but refactor existing architecture versus building a new product in isolation. On the analytical processing side, we decided to spin out a new platform from the original architecture to expose the full power of Stream Processing with Apache Flink. We would create a new internal customer segment from scratch, but we also decided it was the right time to refactor the architecture to minimize redundant investment (between the Keystone and Flink platforms). The lower Flink platform supports both Keystone and custom use cases in this new architecture.

(Figure: Architecture splitting Flink Platform as a separate product entry point)

Bet 2: Start with streaming ETL and observability use cases versus tackling all custom use cases all at once. There were many opportunities, but we decided to focus on streaming ETL use cases on the analytical side and real-time observability use cases on the operational side. These use cases are the most challenging due to their complexity and scale. With the intent to expose the full power of stream processing, it makes sense for us to tackle and learn from the most difficult ones first.

Bet 3: Share operation responsibilities with customers initially andgradually co-innovate to lower the burden over time. We were so lucky our early adopters were self-sufficient, and we also offered a white-glove support model whenever customers struggled. We gradually expanded operational investments such as autoscaling, managed deployments, intelligent alerting, backfill solutions, etc.

Learnings

Learning 1: A new product entry point to support new custom use cases is a necessary evolution step. It’s also an opportunity to re-architect/refactor and fit into the existing product ecosystem. Do not be lured into building a new system in isolation. Avoid the second-system effect.

Learning 2: Simplicity attracts 80% of the use cases. Flexibility helps the bigger use cases. Retrospectively, these are the observations on the actual customer segments from the past few years. The point I want to convey to the reader here is that prioritizing between supporting either the majority or the higher impact use cases, all depends on the situation. The argument can be made both ways, but you should articulate the reasoning that fits into your business scenarios.

Simplicity and flexibility are NOT two polarizing ends of the spectrum. It’s a closed innovation feedback loop. The power of flexibility will drive new co-innovation with a small subset of customers. In the beginning, these innovations may be more expensive, but after it’s proven, the value could eventually become a commodity and go back to the simplified experience. As these new values help growing customers, a small subset of the new users will ask for the power of flexibility again.

Learning 3: Treat your early adopters well. They are the most loyal customers and will do the marketing for you for free. Thanks to the endorsement from our early adopters, our use cases exploded to thousands during this phase.

Learning 4: When things break, don’t panic. Trust all the people around you. A bonus point if you already have a community supporting the product. I remember a time we experienced a slow degradation of the entire platform. Every day, we were getting loads of pages, and we could not figure out the root cause for two weeks. It was terrifying, the team was miserable, and customers were suffering. But the team was able to work together across the team boundaries involving the folks with the right expertise, using data to logically reason through symptoms. Eventually, we found a bug in the Linux Kernel which caused a slow memory leak specific to the streaming workload. We had to trust all the people involved, sometimes even when we didn’t have all the expertise!

Phase 4: Expand Stream Processing Responsibilities — Challenges and Opportunities ahead (2020 — Present)

(Figure: how stream processing fits in Netflix — 2021)

Context

As stream processing use cases expanded to all organizations in Netflix, we discovered new patterns, and we enjoyed the early success. But it’s not the time for complacency.

Netflix, as a business, continued to explore new frontiers and made heavy investments in Content Production Studio, and more recently in Gaming. A series of new challenges emerged, and we jumped into solving these interesting problem spaces.

Challenges

Challenge 1: Diverse data technologies made coordination difficult. Since teams are empowered, many teams in Netflix are using various data technologies. For instance, on the transactional side: there are Cassandra, MySQL, Postgres, CockroachDB, Distributed Cache, etc. On the analytical side: there are Hive, Iceberg, Presto, Spark, Pig, Druid, Elasticsearch, etc. Many copies of the same data are often stored across different data stores within the Netflix data ecosystem.

With many choices available, it’s human nature to put technologies in dividing buckets. Batch vs. streaming. Transactional store vs. analytical store. Online processing vs. offline processing. These are all commonly debated topics in the data world. The overlapping dividing boundaries often add more confusion to the end-users.

Today, coordinating and working with data across technology boundaries are incredibly challenging. Frontiers are hard to push with dividing boundaries.

Challenge 2: Steeper learning curve. With an ever-increasing amount of available data tools and continued deepening specialization, it is challenging for users to learn and decide what technology fits into a specific use case.

Challenge 3: ML practices are not leveraging the full power of the Data Platform. All previously mentioned challenges add a toll on ML practices. Data Scientists’ feedback loops are long, Data Engineers’ productivity suffers, Product engineers have challenges sharing valuable data. Ultimately, many businesses lose opportunities to adapt to the fast-changing market.

Challenge 4: Scale limits on the Central Platform model. As the central data platform scales use cases at a superlinear rate, it’s unsustainable to have a single point of contact for support. It’s the right time to evaluate a model where the central platform supports local-central platforms for added leverage (this means we’ll prioritize supporting the local platforms that are built on top of our platform).

Opportunities

I’ll be relatively brief on this section and expand the details in future blog posts.

Use streams to connect worlds. For Stream Processing, besides the strength of low latency processing, it’s increasingly showing an even more critical strength in modern data platforms: connecting various technologies and enabling fluid data exchanges. Technologies such as Change Data Capture (CDC), streaming materialized views, and Data Mesh concepts are gaining popularity. Finally, Martin Kleppmann’s 2015 vision of “Turning Database Inside Out” starts to realize its value.

Raise abstraction by combining the best of both simplicity and flexibility. There is much value in understanding the deep internals of various data technologies, but not everyone needs to do it. This line of thinking is especially true as cloud-first data infrastructures are becoming commodities. Properly raising data infrastructure abstraction becomes the immediate opportunity to make all the advanced capabilities easily accessible to a broader audience. Technologies such as streaming SQL will lower the barrier of entry, but it’s just the beginning. Data Platform should also raise abstraction to the dividing boundaries (e.g., streaming vs. batch) are invisible to the end-users.

(Figure: the sweet spot between simplicity and flexibility)

Machine Learning needs more love from modern Data platforms. ML folks are arguably the most impactful to business and the most under-served group among all the developer personas. All ML platforms depend on data storage and processing. So Data Platform has many opportunities to extend helping hands to the ML world: such as data quality, reliability & scalability, dev-to-prod feedback loop, real-time observability, overall efficiency, etc.

Stream Processing Patterns Summary in Phase 4

+-----------------------------------------------------------------+
| Pattern               | Product  | Example Use Cases            |
|-----------------------|----------|------------------------------|
| Streaming Backfill /  | Flink    | Pipeline Failure mitigation, |
| Restatement           |          | Avoid cold start             |
|-----------------------|----------|------------------------------|
| Data Quality Control  | Keystone,| Schema evolution management, |
|                       | Flink    | Data Quality SLA,            |
|                       |          | Cost reduction via Avro      |
|                       |          | compression                  |
|-----------------------|----------|------------------------------|
| Source/Sink Agnostic  | Keystone,| Delta, Data Mesh,            |
| Data Synchronization  | Flink    | Operational reporting,       |
|                       |          | Notification,                |
|                       |          | Search Indexing Pipeline     |
|-----------------------|----------|------------------------------|
| Near-real-time (NRT)  | Flink    | Customer service recommend-  |
| Inference             |          | ation, Intent-based in-      |
|                       |          | session adaptations          |
|-----------------------|----------|------------------------------|
| Streaming SQL         | Flink    | Dynamic feature Engineering  |
|-----------------------|----------|------------------------------|
| Intelligent Operation | 4        | Auto-diagnosis & remediation |
+-----------------------+----------+------------------------------+

What’s the Next Frontier

Thank you for reaching this far. This blog post describes the high-level iterative journey of building out the stream processing infrastructure at Netflix. I would love to hear your feedback on what’s interesting, so I can follow up with future blog posts.

By design, I omitted many technical details in this post. But if you are interested in finding out more, please refer to the Appendix section for a complete timeline view of all the stream processing innovations in Netflix.

I am very excited about the future of Data Infrastructure, especially to support better Machine Learning experiences. I believe this is the next frontier we shall go boldly towards! If you are intrigued, I highly recommend an excellent read “Real-time machine learning: challenges and solutions” from my amazing friend & colleague Chip.

I am also excited to announce that I am starting a new journey with Chip Huyen to work on a streaming-first Machine Learning platform. We are still very early, and we are looking for a founding infrastructure engineer to shape the future together! If you are remotely intrigued, we would love to hear from you!

If this blog post resonates with you, please reach out. I’d like to connect!

Appendix

Stream Processing Patterns in Netflix

+-----------------------------------------------------------------+
| Pattern               | Phase    | Example Use Cases            |
|-----------------------|----------|------------------------------|
| Data Routing          | 1        | Logging, Data Movement       |
|-----------------------|----------|------------------------------|
| RT Alerts / Dashboard | 1, 2     | SPS Alert,                   |
|                       |          | Infrastructure Health        |
|                       |          | Monitoring (Cassandra &      |
|                       |          | Elasticsearch),              |
|                       |          | RT QoE monitoring            |
+-----------------------------------------------------------------+
| RT Data Sampling/     | 2        | Cost-effective RT Insights   |
| Discovery             |          |                              |
|-----------------------------------------------------------------|
| Stream-to-stream Joins| 3        | Take-fraction computation,   |
| (ETL)                 |          | Recsys label computation     |
|-----------------------|----------|------------------------------|
| Stream-to-table joins | 3        | Side input: join stream with |
| (ETL)                 |          | slow-moving Iceberg table    |
|-----------------------|----------|------------------------------|
| Streaming Sessionizat-| 3        | Personalization Sessionizat- |
| ion (ETL)             |          | ion, Metrics sessionization  |
|-----------------------|----------|------------------------------|
| RT Observability      | 3        | Distributed tracing,         |
|                       |          | Chaos EXPER monitoring,      |
|                       |          | Application monitoring       |
|-----------------------|----------|------------------------------|
| RT Anomaly / Fraud    | 3        | Contextual alert,            |
| Detection             |          | PII detection,               |
|                       |          | Fraudulent login prevention  |
|-----------------------|----------|------------------------------|
| RT DevOps Decision    | 3        | Autoscaling,                 |
| Tool                  |          | Streaming ACA & A/B tests,   |
|                       |          | CDN placement optimization   |
|-----------------------|----------|------------------------------|
| Event Sourced         | 3        | Content Delivery Network     |
| Materialized View     |          | snapshotting                 |
| Streaming Backfill /  |          | Pipeline Failure mitigation, |
| Restatement           |          | Avoid cold start             |
|-----------------------|----------|------------------------------|
| Data Quality Control  | 4        | Schema evolution management, |
|                       |          | Data Quality SLA,            |
|                       |          | Cost reduction via Avro      |
|                       |          | compression                  |
|-----------------------|----------|------------------------------|
| Source/Sink Agnostic  | 4        | Delta, Data Mesh,            |
| Data Synchronization  |          | Operational reporting,       |
|                       |          | Notification,                |
|                       |          | Search Indexing Pipeline     |
|-----------------------|----------|------------------------------|
| Near-real-time (NRT)  | 4        | Customer service recommend-  |
| Inference             |          | ation, Intent-based in-      |
|                       |          | session adaptations          |
|-----------------------|----------|------------------------------|
| Streaming SQL         | 4        | Dynamic feature Engineering  |
|-----------------------|----------|------------------------------|
| Intelligent Operation | 4        | Auto-diagnosis & remediation |
+-----------------------+----------+------------------------------+

References

Indexed in chronological order …

[2015|Use Case] What’s trending on Netflix? improving our recommender systems
[2015|Use Case] SPS: the Pulse of Netflix Streaming | by Netflix
[2016|Platform] Evolution of the Netflix Data Pipeline | by Netflix
[2016|Platform] Kafka Inside Keystone Pipeline
[2016|Platform] Stream-processing with Mantis… | by Netflix
[2017|Platform] Running a Massively Parallel Self serve Distributed Data System At Scale — Zhenzhong Xu
[2017|Platform] FlinkForward: Stream Processing with Flink at Netflix — Monal Daxini
[2017| Platform] Custom, Complex Windows at Scale using Apache Flink — Matt Zimmer (Netflix)
[2017|Use Case] Streaming for Personalization Datasets at Netflix
[2017|Use Case] ChAP: Chaos Automation Platform | by Netflix
[2018|Platform] Keystone Real-time Stream Processing Platform | by Netflix
[2018|Use Case] “Scalable Anomaly Detection (with Zero Machine Learning)” by Arthur Gonigberg
[2018|Use Case] Migrating Batch ETL to Stream Processing: A Netflix Case Study with Kafka and Flink
[2018|Platform] Building Stream Processing as a Service (SPaaS) — Steven Wu
[2018|Platform] Cloud-Native and Scalable Kafka Architecture
[2019|Platform] Scaling Flink in Cloud — Steven Wu
[2019|Use Case] Massive Scale Data Processing at Netflix using Flink — Snehal Nagmote & Pallavi Phadnis
[2019|Use Case] Cost-Effective, Realtime Operational Insights Into Production Systems
[2019|Use Case] Real-time Processing with Flink for Machine Learning at Netflix — Elliot Chow
[2019|Platform] Open Sourcing Mantis: A Platform For Building Cost-Effective, Realtime, Operations-Focused Applications | by Netflix
[2019|Platform] Netflix Data Mesh: Composable Data Processing — Justin Cunningham
[2019|Platform] Netflix: Evolving Keystone to an Open Collaborative Real-time ETL Platform — Alibaba Cloud Community
[2019|Use Case] Delta: A Data Synchronization and Enrichment Platform | by Netflix
[2019|Use Case] DBLog: A Generic Change-Data-Capture Framework | by Netflix
[2020|Platform] Autoscaling Flink at Netflix — Timothy Farkas
[2020|Use Case] How Netflix Uses Kafka for Distributed Streaming — by Confluent
[2020|Use Case] Building metric platform using Flink for massive scale at Netflix — Abhay Amin
[2020|Use Case] Taming Large State: Lessons from Building Stream Processing Joins for Datasets for Personalization
[2020|Use Case] Building Netflix’s Distributed Tracing Infrastructure | by Netflix
[2020|Use Case] Telltale: Netflix Application Monitoring Simplified | by Netflix
[2020|Use Case] How Netflix uses Druid for Real-time Insights to Ensure a High-Quality Experience
[2021|Platform] Data Movement in Netflix Studio via Data Mesh | by Netflix
[2021|Platform] Backfill Flink Data Pipelines with Iceberg Connector
[2022|Platform] Auto-Diagnosis and Remediation in Netflix Data Platform | by Netflix

The Four Innovation Phases of Netflix’s Trillions Scale Real-time Data Infrastru...

The Four Innovation Phases of Netflix’s Trillions Scale Real-time Data Infrastructure

What’s In the Post

Acknowledgment

The Four Phases of Real-time Data Infrastructure at Netflix

Phase 1: Rescue Netflix Logs From the Failing Batch Pipelines (2015)

Context

Challenges

Stream Processing Patterns Summary in Phase 1

Strategy Bets

Learnings

Phase 2: Scale 100s Data Movement Use Cases (2016)

Context

Challenges

Stream Processing Patterns Summary in Phase 2

Strategy Bets

Learnings

Phase 3: Support Custom Needs and Scale Beyond 1000s Use Cases (2017–2019)

Context

Challenges

Stream Processing Patterns Summary in Phase 3

Strategy Bets

Learnings

Phase 4: Expand Stream Processing Responsibilities — Challenges and Opportunities ahead (2020 — Present)

Context

Challenges

Opportunities

Stream Processing Patterns Summary in Phase 4

What’s the Next Frontier

Appendix

Stream Processing Patterns in Netflix

References

Recommend

About Joyk