Cloud infrastructure at Grubhub

Moving to cloud infrastructure at Grubhub enabled a major technical evolution for our high growth, ever-changing e-commerce juggernaut. I’ll share some history and an overview of where we are now.

DISCLAIMER: We’re not announcing or releasing anything open source… sorry. We are sharing information on our architecture, tech, frameworks, and some code in the hope that our experiences may inspire or help others.

“Grubhub is the nation’s leading online and mobile food ordering company dedicated to connecting hungry diners with local takeout restaurants.” [taken from the official investor info page on our site]

I’ll add that on a technical level, we’re a service oriented platform that primarily operates out of multiple AWS data centers (regions) and are continually reducing our reliance on our own data centers and the monoliths they house (not much relies on them today). All the while, we continue to grow, acquire, integrate third parties, work with partners, and add functionality our customers want.

This is the Grubhub of today. If we go back a few years, however, the story is vastly different. Grubhub and Seamless were two separate companies in direct competition. Both had significant data center footprints — largely monolithic platforms that relied heavily on SQL variants. Seamless was a .Net + MS SQL stack, while Grubhub was Java + MySQL. Our MS SQL installation was particularly expensive, as they charge by the core and we had a lot of them! More importantly in both cases, we could see the ceiling of what could be done by scaling up, and it was approaching all too quickly given business growth.

We set out to build a new platform. Our goal was to build a highly scalable, highly available system that could scale to meet traffic, handle various network and infrastructure outages as well as our growing business needs. This led us to a few key decisions:

Service oriented architecture. Bye bye monoliths.
Build on a public cloud. We chose AWS while making efforts to minimize lock-in.
Multiple hot datacenters to withstand localized outages and network problems.
Apache Cassandra as our primary persistent data store.
Java as our primary language.
Continuous delivery — no more code freeze/release windows. We wanted to be able to release any part of the system at any time without affecting availability.

No big deal, we’ll just go from data centers, application servers, and good old SQL to the cloud, services, and NoSQL in the form of Cassandra. We’ll share more information in later articles on how this was done safely, largely without our customers being affected as well as the lessons we learned, but the summary is that it was done step by step. We set a goal architecture, figured out which pieces could be rebuilt in the new way first, and proxied/replicated back to the old systems where necessary until we could safely shut them off — and dance on their graves. actually it was more of a piñata destroying party.

Moving from a monolith and its deployment to a set of (micro) services can mean an explosion of languages, configuration, frameworks, and approaches to solving common problems, e.g. how do they talk to each other. This is inefficient and negatively impacts performance, stability, and operations. In order to avoid this, we decided to standardize where possible. We wanted to allow most of our engineers to focus on building business functionality not infrastructure.

Firstly, we built out base frameworks for building highly available, distributed services. They cover the following:

service base (startup, config, discovery, RESTful APIs, RPC, threading, DI, simulation, logging, metrics, events)
messaging
leader election
supervision (time bucketed, multi-data center reliability built on leader election for tasks that require it — one of our keys to routing around network and region-specific outages)
clients for standard infrastructure (e.g. Cassandra, Elasticsearch)
service providers (for third party interactions, e.g. SMS, email, fax — we try to have at least two and exercise all with some traffic to meet the hot-hot requirement)
big testing base (gatling.io gives us code reuse for testing scenarios, ability to generate load, ramp up, and randomize. Base framework includes authorization and configuration)

This gave us the primitives necessary to build the platform on. As you can see below, it’s composed of a number of services (Platform Router Proxy, Messaging Router, Config, Discovery, Security) which were all (with the exception of Discovery — we use Netflix’s Eureka) built on the the frameworks above.

Key features:

Enables continuous delivery through traffic routing.
Is highly available and performant.
Operates in multiple data centers.
Platform services are themselves built on the base frameworks mentioned above.

On our platform, services are versioned and can be started in “real” or “simulated” mode. They also produce a separate tester. Services advertise their routing percentage (traffic weight), API/RPC routes, and security demands via the discovery service. The platform router proxy respects traffic weights, applies security demands, and proxies traffic to appropriate service instances. The messaging router does the same job as the router proxy but for messaging.

This is very test friendly, which leads to it being a safe platform for continuous delivery because our traffic weights apply at all levels in the graph of services that provide functionality. We have the ability to specify versions in order to compose test scenarios that exercise backward and forward compatibility and introduce failures throughout.

Here are a few standard pieces of infrastructure we’re using that are worth a mention:

Docker — We containerize all services, testers and utilities. It’s our standard unit of deployment.
Netflix Eureka — Provides service discovery within a platform instance (it’s the discovery service on the diagram above).
Apache Cassandra — As mentioned above, this is our standard store for persistent data.
Elasticsearch — Used by a number of services that require search and more complex querying capabilities.
Datadog — All services send custom and standard metrics (provided by our base frameworks). We have some automation that builds dashboards and alerts in Datadog, but we also do this manually ad-hoc.
Splunk — All services send access and application logs here. We try to use this as a fallback and for investigations rather than building alerts and complex queries from logs (that’s what metrics are for).
Selected AWS services (SNS, SQS, S3 and a few more are commonly used).

Our platform has been in production since early 2015. Since then, we have incurred zero downtime. The platform has been the foundation on which Grubhub systems have continued to grow and scale. We continue to evolve and update in particular as open source solutions meet our needs. The platform services were primarily built and continue to be maintained by a small team (two pizza teams) and our world class SREs, along with peer review and contribution from other engineering teams across Grubhub.

We’ll share specific detail in future with more in-depth articles. For now, if you have any questions, leave us a comment here.

Cloud infrastructure at Grubhub

Cloud infrastructure at Grubhub

Recommend

为什么很多用了几十年的主板也没听说过BIOS的数据坏了的呢？BIOS存储在哪里？

Landauer's principle

云服务器带外管理的神兵利器：RedFish

Decisions are first class citizens: an introduction to Decision Engineering

[1504.05155] The Classification of Reversible Bit Operations

A circuit-like notation for lambda calculus

如何自制一个UEFI实用小工具：RomHover

Two Kinds of Feedback

Business Ethics and the “New York Times” Rule | The Business Ethics Blog

什么是I3C总线？它和I2C和SMBus是什么关系？

About Joyk