Owl: Meta's New Hot-Content Distribution System

Aug 31, 2022 2 min read

Meta recently revealed Owl, their new hot-content distribution system that provides high-fanout distribution of large data objects to hosts in Meta's private cloud.

Owl consists of a decentralized data plane based on peer-to-peer distribution trees with a centralized control plane - tracker services that keep metadata about the peers, their cache state, and ongoing downloads. It also provides a configurable policy interface that customizes varied distribution use cases.

Before Owl, Meta had tried three different systems to solve this problem.

The first implementation used hierarchical caching to provide centralized content distribution. This system was easy to operate but had various issues. It required many machines for caching. Sudden load spikes led to the throttling of requests. Provisioning was challenging due to the system's high infrastructure demands.

In the next set of implementations, Meta engineers used two highly decentralized solutions: a location-aware BitTorrent implementation and a static, hash-based peer-to-peer distribution tree.

These highly decentralized systems scaled better than hierarchical caching. But they introduced a new set of challenges.

In these systems, the peers decided where to fetch and what to cache. Each peer made independent decisions, resulting in suboptimal results on where to fetch. To get the status and health of the system, engineers had to collect and aggregate the data from peers.

As a result, the highly decentralized systems were inefficient and difficult to maintain, while the highly centralized systems did not scale well.

To solve this problem, the engineers at Meta decided to use the split approach. In the new design, peers are simple and provide the mechanism for caching and transferring data chunks, while the central control plane is composed of trackers that identify the sources from which peers should get each chunk of content, when and how to cache fetched content, and how to retry failed downloads.

A request is sent to the tracker when a client asks for a chunk. The tracker asks the Superpeer to get the chunk from the external storage, cache it and then ask the client to download it.

Superpeers are tasks running the Owl peer library as a standalone process without any client. They have large and long-lived caches. They can also fetch data from external storage systems that a regular peer cannot fetch.

The team had anticipated that the capacity of a single tracker would not be enough as usage of Owl increased. Thus, they added the capability to shard peers across multiple trackers.

To ensure security, all communications between Owl components are encrypted, and all RPCs are checked against access control lists. When reading chunks from external sources, Owl generates an internal checksum, passes it with the chunk data, and validates it before returning it to clients, thus maintaining integrity across the process.

1Screenshot%202022-08-30%20at%2012.26.45%20PM-1661870464595.png

Image source: the original article on the Engineering at Meta blog

Another challenge was that some clients needed low latency, while others wanted to reduce the load on external storage to avoid throttling.

Since the distribution policies are at the tracker level, teams now have the flexibility to update the policies quickly and customize the content distribution per each client type.

As mentioned in the paper, Owl distributes over 700 PB of hot content daily to millions of peers at Meta with a cache hit rate of over 90%.

About the Author

Tanmay Deshpande

Tanmay has extensive experience architecting, scaling, and delivering large-scale, complex projects. As a technology expert, he stays on top of the latest trends in Cloud, DevOps, SRE, Distributed Systems, and Application Development. His passion is leveraging technology to transform and optimize business and operating models.

In partnership with Directors and other members of senior leadership, he works to define a long-term vision for his organization, taking into account both the business and market dynamics, as well as the technical capabilities and limitations of the company’s software. His time is dedicated to coaching and mentoring his teammates (especially those looking to become Senior/Enterprise Architects). He takes into account their skills, backgrounds, and working styles in order to provide thoughtful, constructive feedback.

He is an avid technology blogger (https://deshpandetanmay.medium.com/) and publishes a weekly newsletter of curated articles(https://tanmaydeshpande.substack.com/)

Owl: Meta's New Hot-Content Distribution System

Owl: Meta's New Hot-Content Distribution System

About the Author

Tanmay Deshpande

Recommend

Time Ontology in OWL

W3C Recommendation: Time Ontology in OWL

Understanding The Owl Document Management Permissioning Model

The Owl's Right Eye - Unintended Consequences

Full OWL Resources for Grades 7-12

lobotomized owl selector demo

mojida-based-night-owl.omp.json

Blizzard is hoping Overwatch 2 leads to a superb OWL season 5

Owl: Distributing content at Meta scale

Hot Keys, Scalability, and the Zipf Distribution

About Joyk