4

Next Generation of Data Movement and Processing Platform at Netflix

 2 years ago
source link: https://www.infoq.com/news/2022/08/netflix-data-mesh/?itm_source=infoq&itm_medium=popular_widget&itm_campaign=popular_content_list&itm_content=
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Next Generation of Data Movement and Processing Platform at Netflix

Aug 29, 2022 2 min read

Netflix engineering recently published in a tech blog how they used data mesh architecture and principles as the next generation of data platform and processing to unleash more business use cases and opportunities.

Data mesh is the new paradigm shift in data management that enables users to easily import and use data without transporting it to a centralized location like a data lake. It focuses on decentralization and distributing data ownership in different business domains. Each domain manages and governs its data as a product, making it discoverable and trusted based on required business service level objectives (SLOs).

The data mesh architecture in Netflix is composed of two main layers. One is the control plane (controllers) and the other is the data plane (data pipelines). It is defined in the blog post simply as: 

The controller receives user requests and deploys and orchestrates pipelines. Once deployed, the pipeline performs the actual heavy lifting data processing work. Provisioning a pipeline involves different resources. The controller delegates the responsibility to the corresponding microservices to manage their life cycle.

The following diagram shows the high-level architecture of the data mesh in Netflix : 

1data-mesh-architecture-1661738740801.jpg

Netflix high-level data mesh architecture

The main operations are done in pipelines. Pipeline reads data from various resources, applies algorithms, and transports them to the destination. The controllers are responsible for figuring out the resources associated with the pipelines and calculating the correct configuration. The following diagram shows an example of the pipeline architecture:

1Piplines-1661738740801.jpg

Pipelines high-level architecture

Data sources are domain data related to each business unit and processed by different processors in the system. Engineers are mainly using Apache Flink for real-time data processing. Connectors are the main element to initiate data transfer. They monitor data sources and produce change data capture (CDC) events to the data mesh. Apache Kafka is acting as the main component for data transporters in data mesh. Data schema and catalog are very important in providing searchability and visibility of the data among different business domains. Netflix uses Apache Avro as the standard schema among domains.

Many businesses have already started to shift into this paradigm based on their needs. Chris Damour, one of the readers mentioned in the blog post comments :

We have also implemented the features to track the data lineage so that our users can have a better picture of the overall data usage. I'd love to see a blog on that, data provenance has been a real struggle in our data meshes, we get the "same".

Enterprises are now planning and shifting their data platform based on this architecture including Intuit and Zalando. Many cloud providers like Amazon, Google, and Microsoft are also providing data mesh solutions and services for their customers.

About the Author

Reza Rahimi

Reza is a senior engineering manager at Dropbox, leading personalization, consumer targeting, and revenue-related ML products. He has more than 15 years of industrial work experience in big data management, analytics, and machine learning usage in the different business segments (software as a service, consumer web, cloud services, consumer electronics, and healthcare insurance). He got his Ph.D. from UC Irvine in computer science and published about 20 books, journals, and conference papers in the area of machine learning, cloud computing, and distributed systems. He is a Harvard business review advisory council member, MIT technology review global panel member, and senior member of IEEE.

Show more

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK