4

Tales of Kafka at Cloudflare

 1 year ago
source link: https://www.infoq.com/news/2023/04/cloudflare-kafka-lessons-learned/?itm_source=infoq&itm_medium=popular_widget&itm_campaign=popular_content_list&itm_content=
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Tales of Kafka at Cloudflare

Apr 03, 2023 1 min read

At QCon London, Andrea Medda, senior systems engineer at Cloudflare, and Matt Boyle, engineering manager at Cloudflare, shared the lessons their platform services team learned from enabling the use of Apache Kafka at the scale of 1 trillion messages.

Boyle began by outlining the problems that Cloudflare needs its technology to solve, namely providing its own private and public cloud, and the operational challenge of coupling between teams that arose as their business needs grew and evolved. He went on to identify how Apache Kafka was selected as their implementation of the message bus pattern.

13B790C9D-FA94-43DD-B4CA-E21C65FC0EF4-1680460271602.JPG

While the messagebus pattern enabled the decoupling of load between microservices, Boyle explained how services still ended up being tightly coupled because of an unstructured approach to schema management. To solve this problem, they opted to migrate from JSON messages to Protobuf and to build a client-side library to validate messages prior to publishing them.

1Captura%20de%20pantalla%202023-04-02%20192529-1680460271602.png

As the adoption of Apache Kafka grew across their teams, they developed a Connector Framework to make it easier for teams to stream data between Apache Kafka and other systems while transforming the messages in the process.

2F96AC728-906A-437C-89A4-6A1B2F49F973-1680461050766.JPG

Over the pandemic, as load on Cloudflare’s systems grew, the team began to observe bottlenecks on a key consumer which had begun to breach its Service Level Agreements. Medda explained how the team's initial struggle to identify the root cause of the issue prompted them to enrich their software development kits (SDKs) with tooling from the Open Telemetry ecosystem to gain better visibility of interactions across their stack.

1Captura%20de%20pantalla%202023-04-02%20192831-1680460271602.png

Medda went on to highlight how the success of their SDKs brought more internal users which spurred a need for better support in the form of documentation and ChatOps.

Medda summarized the key lessons as:

  • Striking the balance between highly configurable and simple standardized approaches when providing developer tooling for Apache Kafka
  • Opting for a simple and strict 1:1 contract interface to ensure maximum visibility into the workings of topics and their usage
  • Investing in metrics on development tooling to allow problems to be easily surfaced
  • Prioritizing clear documentation on patterns for application developers to enable consistency in adoption and use of Apache Kafka

Finally, Boyle shared a new internal product, called Gaia, that the team was building to enable push-button creation of services according to Cloudflare’s best practices.

1Captura%20de%20pantalla%202023-04-02%20193016-1680460271602.png

About the Author

Nsikan Essien

Nsikan works as an Engineering Manager at Field Energy. Cloud architectures, platform services and the development of effective teams are his main working interests. He is based in London.

Show more

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK