Infrastructure

Data transfer in Manhattan using RocksDB

Introduction

Manhattan is Twitter’s internally-developed distributed key-value store. It’s the default storage solution at Twitter for persistent real-time serving needs. It is used to serve all core nouns at Twitter including Tweets, Users, and Direct Messages (DMs). Previously, we wrote about Manhattan’s pluggable storage engine design and how we adopted RocksDB as the storage engine for all its read-write workloads.

In this post, we’ll talk in more detail about a performance and stability problem we encountered during the migration and how we solved it.

Data transfer in a storage system

Data transfer between storage machines is a typical process in any distributed system that stores persistent data. There are typically two reasons why this data transfer occurs:

When you remove nodes from a cluster. For example, the hardware of a machine starts failing and it needs to be taken out of the cluster for maintenance. The data that was previously served by the node being removed needs to be transferred and served from a different node now.
When you add nodes to a cluster. For example, the cluster is getting overloaded with traffic/data and needs additional capacity (that is nodes) to handle the workload. There is a transfer of data from the existing nodes to the newly added nodes to evenly distribute the data across all the nodes in the cluster.

The faster a database cluster is able to handle this data transfer, the sooner it will be in a healthy state. This is because you don’t want bad hardware to linger in your cluster for too long while removing nodes. Similarly, you want the ability to add capacity to your cluster quickly in the event of large traffic spikes.

Types of data transfer

In Manhattan, we refer to the process of transferring data from one node to another as streaming. Streaming can typically be done in two ways:

Optimized Streaming (File-based transfer): In this case the source node generates and sends over ingestible [1] data files to the destination node. The destination node ingests those files and becomes the new owner of that piece of data.
Generic Streaming (Record-based transfer): In this case, the source node iterates through all its data and sends over individual records to the destination node. The destination node writes these records to its storage as it receives them and at the end of the transfer it becomes the new owner of that piece of data.

While the optimized streaming can be better and more efficient, if the cost of generating the ingestible data files is high, then generic streaming becomes the more viable option. This was exactly the case for us, and we’ll expand more on this in the next section.

Data transfer using RocksDB

During the RocksDB adoption, we started with generic streaming when we migrated data from our older SSTable/MhBtree storage engines to the RocksDB storage engine, as mentioned in an earlier blog post. As each storage engine stores data differently (in that it uses different file formats), optimized streaming wouldn’t work as the files wouldn’t be ingestible at the destination node.

However, generic streaming speed was slow and it took a long time to add and remove nodes from the cluster. We sped it up as much as we could until we started encountering write-stalls on the destination node. This was because generic streaming is similar to a spike in frontend write traffic as the data arrives in the form of individual write requests. We thought that using optimized streaming, after the data migration was complete, would speed this up and prevent us from running into blockers such as write-stalls. Unfortunately, we faced other issues while implementing optimized streaming.

RocksDB recommends creating ingestible SST files by iterating over the entire database and creating them using rocksdb::SstFileWriter. This is because the live SST files that the database uses are not directly ingestible by other databases. Instead, RocksDB iterates over the database to produce a set of SST files that contain all the data in increasing order. These files can then be transferred over to another database for ingestion. Ingesting such SST files improves load-time considerably as there are no duplicates and all the data is sorted. However, since the file generation time was considerable we decided to invest in our existing generic streaming solution first while prototyping optimized streaming for RocksDB.

Speeding up Generic Streaming

Let’s lay some background before we jump into how we optimized our existing solution.

Data transfer in Manhattan using RocksDB

Infrastructure

Data transfer in Manhattan using RocksDB

Introduction

Data transfer in a storage system

Types of data transfer

Data transfer using RocksDB

Speeding up Generic Streaming

Recommend

GNNs through the lens of differential geometry and algebraic topology

3 Reasons Why Ontology Becoming EVM-Compatible Is A Big Deal

Next generation data insights using natural language queries

网传今晚上海封城？上海六院什么情况？最新回应

收缩、裁员、提速、融资，生鲜零售2022年往哪走？

机构今日买入这13股，卖出万孚生物4.6亿元丨龙虎榜

北向资金今日净流出50亿元卖出贵州茅台8.48亿元

总理记者会：应对气候变化等问题，需中长期有力应对

5 Main Reasons Why Mood Boards are Essential for Design Process

俄乌局势、经济增速、台湾问题…今年记者会总理回答了这14个问题｜一图速览

About Joyk