Evaluating Google Cloud Spanner and BigTable

Motivation

As one can imagine, the millions of active Fitbit users generate a lot of data. All that data has to be processed and stored so that users can look back on historical step counts, sleep, etc. One such storage service that I focused on as an intern on the Data Storage team, is the Device Communication Log (DCL) service. The messages sent through the DCL service contain everything the tracker collects, including user activity data and tracker state. DCL data is currently stored in a sharded MySQL database. While the relational nature of MySQL makes queries straightforward, the Data Storage team dedicates a fair amount of time to the maintenance of our sharding infrastructure to scale MySQL for our needs.

Overview of DCL

Any storage solution for DCL has to support common read and write operations our apps need:

Saving a sync entry to the database
Reading a sync entry based on a unique log entry ID
Reading sync entries for a given user/device in a time window
Deleting all sync entries for the user/device

The DCL workflow is write-heavy: once data is written to the database, it isn’t often read. When it is read, it is often the most recent data.

Spanner and BigTable

With Fitbit moving it’s infrastructure to the Google Cloud Platform (GCP), I evaluated two Google Cloud stores, Spanner and BigTable as alternatives to MySQL. Spanner and BigTable are fully managed services, with routing and sharding handled internally. Spanner is a scalable relational database service with transactional consistency, SQL query support, and secondary indexes. BigTable is a scalable NoSQL database. Other options within GCP such as Cloud SQL and Cloud Storage weren’t evaluated.

Design

When designing our schema, we wanted to avoid write hotspots, and large table scans for reads. User IDs and Device IDs are generated in monotonically increasing order, and it is reasonable to think that newer users sync more often, contributing to more traffic.

Spanner

Our Spanner model has a single table with {reversed device ID, unique log entry id} as the primary key. The device ID is reversed to avoid hotspots. It has two secondary indexes – One on {reversed user ID, creation time}, and the other on {reversed device ID, creation time}.

Figure 1: Spanner table

BigTable

We modeled data in BigTable in two ways – what we call “BigTable Tall” and “BigTable Versions”.

BigTable Tall

BigTable Tall has a main table with the row key {reversed device ID # timestamp # log entry ID}.

Figure 2: BigTable Tall : Main table

We also created a second table for lookups by user id. This table only held a reference to the device id, so, each query to this table would then lead to a second query on the main table.

Figure 3: BigTable Tall : User ID lookup table

BigTable Versions

This model treats each sync entry as a different “version” of data for a device. The row key in the main table is simply the reversed device ID. Similar to BigTable Tall, there is a second User ID-based lookup table.

Figure 4: BigTable Versions : Main table

Performance and Cost

Figure 5: 99th percentile latency for each operation

Performance testing was done by running JMeter scripts remotely on GCE instances in the same Google Cloud region as the databases (to minimize network latency). Performance testing was only completed against the Spanner and BigTable implementations, while comparable tests against the current MySQL setup were in progress.

As figure 5 shows, BigTable Versions performed poorly on p99 latency for reads by user ID. BigTable doesn’t seem to be efficiently retrieving a specific version/timestamp here. Spanner outperformed both BigTable implementations significantly for getting an entry by device ID and log entry ID, because this is a row lookup on the primary key. BigTable requires a scan of all rows for a device. A possible improvement (at the cost of more storage) is to make the row key of the main table similar to Spanner’s primary key and create two secondary tables allowing for lookups based on device and user ID.

To estimate cost, I determined how much the “reported” size of the data would differ from MySQL. Both Spanner and BigTable store data in Colossus, which uses Reed-Solomon error correction to improve fault tolerance and adds to storage space. I found that for each GB of data in MySQL, the data would take 1.5 GB in Spanner and 1.2 GB in BigTable. Taking this into consideration, I found that for our storage needs, Spanner would be 25% less expensive, and either BigTable approach would be 30% less expensive than our current MySQL setup.

Conclusions

Our general takeaway is that a BigTable implementation might be best for performance-sensitive services that have simple lookups or can tolerate data being potentially inconsistent across tables. Spanner may perform slightly worse and cost more than BigTable but will provide strong consistency and be easy to reason about. Either would cut costs compared to our current MySQL setup. Most importantly, using one of these storage systems would make services easier to maintain, allowing us to focus on business critical work instead of maintaining MySQL shards.

The Internship Experience

Coming into this summer, my knowledge of data storage was fairly limited. I could give a textbook definition of ACID transactions, but didn’t know how these concepts applied to real production systems. Needless to say, I got hands-on experience around storing data effectively. I learned that knowing things is helpful: having taken a class that covered databases allowed me to dive deeper into my project because I didn’t first have to learn the basics. But I also learned, perhaps more importantly, that not knowing things is fine too. On that note, I would like to thank my mentor, Devika Karnik; my manager, Bryce Yan; and the rest of the Data Storage team for answering all my questions and being so welcoming this summer! I learned a lot, but the true highlight was getting to know all the wonderful people I worked with.

About the Author

Josh Rosenkranz is currently a senior at MIT. He is a member of the class of 2019 studying Computer Science and Electrical Engineering. Josh is also the cross country team captain in the varsity Cross Country and Track and Field team at MIT. His interests in running and computer science made perfect sense for him to spend his summer in 2018 as an intern in the Data Storage team at Fitbit. While spending time working on his internship project, Josh also won the 5K race at the 2018 San Francisco Marathon.

Motivation

Overview of DCL

Spanner and BigTable

Design

Spanner

BigTable

BigTable Tall

BigTable Versions

Performance and Cost

Conclusions

The Internship Experience

About the Author

Recommend

What is Bitgatt and why do we need it?

Load testing microservices and identifying scalability issues

The Tower of Terror: A Bug Mystery

A machine learning solution for detecting and mitigating flaky tests

Learnings from KotlinConf 2019

Do we still need LeakCanary now that Android Studio 3.6 has “Memory Leak Detecti...

Kotlin Coroutines – Use Cases on Android

Comparing Kotlin Coroutines with Callbacks and RxJava

Do I need to call suspend functions of Retrofit and Room on a background thread?

How to run an expensive calculation with Kotlin Coroutines on the Android Main T...

About Joyk