FaRM: Fast Remote Memory
source link: https://blog.carlosgaldino.com/farm-fast-remote-memory.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
FaRM: Fast Remote Memory
26 Jul 2016FaRM is a main memory distributed computing platform that provides distributed transactions with strict serializability, high performance, durability, and high availability.
To scale out, FaRM distributes objects across machines in a data center and also allows transactions to span any number of machines. To reduce CPU overhead it uses one-sided RDMA (Remote Direct Memory Access) operations.
Everything is stored in DRAM and it provides durability by attaching batteries to power supply units and writing the contents of DRAM to SSD when the power fails.
Programming model and architecture
FaRM provides the abstraction of a global shared address space that spans machines in a cluster. Each machine runs application threads and stores objects in the address space. The FaRM API provides transparent access to local and remote objects within transactions. Transactions can be started at any time and the coordinator will be whoever started it. The thread that started the transaction can execute arbitrary logic as well as read, write, allocate, and free objects. At the end of execution, the thread invokes FaRM to commit the transaction.
FaRM uses optimistic concurrency control. The operations are buffered locally and are only made visible to other transactions on successful commit.
FaRM provides strict serializability of all successfully committed transactions. Individual reads are atomic, reading only committed data, successive reads of the same object return the same data, and reading objects written by the transaction return the latest value written. Reads of different objects are not atomic but the transaction will not commit, thus the strict serializability property still holds.
Other features provided by the FaRM API is lock-free reads, which are optimized single-object read only transactions, and locality hints, enabling programmers to co-locate related objects on the same set of machines.
Each machine runs FaRM in a user process with a kernel thread pinned to each hardware thread. Each kernel thread runs an event loop that executes application code and polls the RDMA completion queues.
The configuration can change over time as machines fail or are added to the
cluster. FaRM represent the configuration via a tuple, {i, S, F, CM}
, where:
i
: unique, monotonically increasing 64-bit configuration identifier;S
: set of machines in the configuration;F
: mapping from machines to failure domains that are expected to fail independently (e.g., different racks);CM
: configuration manager.CM ∈ S
.
ZooKeeper is used as the coordination service to ensure that machines agree on the current configuration and store it. Lease management, failure detection and coordination recovery do not use ZooKeeper.
Regions of 2 GB are used as the global address space in FaRM and each region is
replicated in f + 1
machines, where f
is the desired fault tolerance.
Objects are always read from the primary copy of the containing region, and it
uses local memory access if the region is on the local machine, otherwise
one-sided RDMA read is used.
The CM
is contacted when a machine wants to allocate a new region. The CM
assigns a region identifier from a monotonically increasing counter and will
select replicas for the region. Replica selection takes into account the number
of regions stored on each machine, the available storage, and each replica is
placed in a different failure domain. The region might be co-located with a
target region if the application specifies a locality constraint. A two-phase
protocol is used to coordinate the allocation in all replicas. The CM
sends a
prepare message to all replicas and they report if they allocated the region
successfully. If all reply with success the CM
sends a commit message to all
of them.
Each machine also stores ring buffers that implement FIFO queues. The queues are used either as transaction logs or message queues.
The figure below shows the FaRM architecture:
┌──────────────────────────────────┐
│ │
┌ ─ ─ ─ ─ ─ ─ ─ ─ ─│─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ lease
│ Machine A renewals
│ ▼ CPU │ │ ┌ ─ ─ ─ ─ ─ ─
┌───────────────┐ │ Machine D │
│ ┌──│ Application │◀───────┐ │ └────────────│ (CM)
│ ├───────────────┤ │ ─ ─ ─ ─ ─ ─ ┘
│ │ │ FARM │ │ │
local └───────────────┘ │
│ reads ▲ ▲ │ │
│ │ └────┐ │
│ ▼ │ │ NVRAM │ │
┌────────┬────────┬────────┬───────────┐
│ │ Region │ Tx log │ Tx log │ Msg queue │ │
└────────┴────────┴────────┴───────────┘
│ ▲ ▲ ▲ ▲ │ ┌─┐
│ │ │ │ ┌─────│ │─────┐
└ ─ ─ ─ ┼ ─ ─ ─ ─│─ ─ ─ ─ ─ ┼ ─ ─ ─ ─│─ ─ ─ ─ ┘ │ └─┘ │
│ │ │ │ ┌─┐ ┌─┐
remote tx records │ messages │ │ │ │
reads │ │ │ └─┘ └─┘
│ │ tx records │ │ │
│ │ │ │ │ ┌─┐ ┌─┐ │
│ │ │ │ └─│ │─────│ │─┘
│ │ │ │ └─┘ └─┘
┌ ─ ─ ─ ─ ─ ─ ┌ ─ ─ ─ ─ ─ ─ Coordination
Machine B │ Machine C │ service
└ ─ ─ ─ ─ ─ ─ └ ─ ─ ─ ─ ─ ─ (ZooKeeper)
Distributed transactions and replication
To improve performance FaRM integrates the transaction and replication protocols. Both data and transaction logs use primary-backup replication in non-volatile DRAM, and it uses unreplicated transaction coordinators that communicate directly with primaries and backups. As mentioned before, FaRM uses optimistic concurrency control with read validation.
The timeline for a FaRM transaction is shown below:
During the execution phase, transactions use one-sided RDMA to read objects and they buffer the writes locally. Versions and addresses of all accessed objects are recorded by the coordinator. For primaries and backups on the same machine as the coordinator, object reads and writes to the log use local memory accesses rather than RDMA. At the end of the execution, FaRM attempts to commit the transaction by performing the following steps:
- Lock: The coordinator writes a
LOCK
record to the log of each machine that is a primary for any written object. This record contains the versions and new values of all written objects on that primary, as well as the list of all regions with written objects. The primaries process the records by attempting to lock the objects at the specified versions using compare-and-swap, and send back a message reporting if all locks were successfully taken. If any object version changed since it was read by the transaction, or if the object is currently locked by another transaction then the locking fails and the coordinator aborts the transaction. In this case, the coordinator writes an abort record to all primaries and returns an error to the application. - Validate: All objects that were read but not written by the transaction are then validated by the coordinator by reading their versions from their primaries. The validation will fail and the transaction will abort if any object has changed.
- Commit backups: The coordinator writes a
COMMIT-BACKUP
record to the non-volatile logs at each backup and then waits for an ack from the NIC (Network Interface Controller) hardware without interrupting the backup’s CPU. TheCOMMIT-BACKUP
log record has the same payload as aLOCK
record. - Commit primaries: The coordinator writes a
COMMIT-PRIMARY
record to the log at each primary after allCOMMIT-BACKUP
writes have been acked. The coordinator then reports completion to the application after receiving at least one hardware ack for aCOMMIT-PRIMARY
record, or if the coordinator wrote one locally. The primaries process these records by updating the objects, incrementing their versions, and unlocking them, which exposes the writes committed by the transaction. - Truncate: Both backups and primaries keep the records in their logs until they are truncated. Truncation is done lazily by the coordinator after it receives acks from all primaries. Backups apply the updates to their copies of the objects at truncation time.
Correctness
The serialization of committed read-write transactions occurs when the locks are acquired, for committed read-only transactions it occurs at the point of their last read. This is because the versions of all read and written objects at the serialization point are the same as the versions seen during execution.
Serializability in FaRM is also strict1: the serialization point is always between the start of execution and the completion being reported to the application.
Even in case of failures FaRM ensures serializability. This is achieved by
waiting acks for all backups before writing COMMIT-PRIMARY
. Imagine that the
coordinator does not receive an ack for a backup and writes COMMIT-PRIMARY
.
The application will receive a confirmation that the transaction was successful.
If all replicas that stored the modification fail and the only backup surviving
is the one that didn’t acked because it never received COMMIT-BACKUP
, this
would result in losing the modification that was once confirmed.
For COMMIT-PRIMARY
a single ack is enough for reporting back success to the
application. This is true because FaRM assumes up to f
failures for a f + 1
shard. Then at least one machine will have a record attesting the commit. If
not a single one COMMIT-PRIMARY
was written and the primaries and backups
failed, the LOCK
records would survive but there wouldn’t be any record
attesting that validation succeeded.
Failure recovery
Durability and high availability are provided by replicating the data. FaRM assumes that machines can fail by crashing but can recover (fail-recovery) without losing the contents of non-volatile DRAM.
FaRM provides durability for all committed transactions even if the entire
cluster fails or loses power: the committed state is recovered from regions and
logs stored in non-volatile DRAM. It ensures durability even if at most f
replicas per object lose the contents of non-volatile DRAM. FaRM can maintain
availability with failures and network partitions provided that a majority of
machines remain connected to each other and to a majority of replicas in the
ZooKeeper service, and the partition contains at least one replica of each object.
Failure detection
FaRM uses leases to detect failures. Every machine other than the CM
holds a
lease at the CM
and the CM
holds a lease at every other machine. If any of
these leases expire, failure recovery is then triggered.
FaRM uses dedicated queue pairs for leases to avoid having lease messages delayed in a shared queue behind other message types. It also uses a dedicated lease manager thread that runs at the highest user-space priority. This thread is not pinned to any hardware thread and it uses interrupts instead of polling to avoid starving critical OS tasks that are executed periodically on every hardware thread.
Reconfiguration
Reconfiguration happens when a failure occurs but there is a protocol that is followed to move a FaRM instance to the next configuration.
FaRM implements something they called precise membership. This guarantees that machines will only talk to other machines in the same configuration. Messages from machines outside the configuration are ignored.
The reconfiguration steps are as follows:
- Suspect: If a lease for a machine at the
CM
expires, theCM
suspects that a failure occurred for that machine and reconfiguration is initiated. TheCM
then starts blocking all external client requests. If a non-CM
machine suspects that the CM failed due to a lease expiry, it will first ask a small number of “backupCM
s” to initiate reconfiguration. If the configuration is unchanged after a timeout period it attempts the reconfiguration itself. - Probe: An RDMA read is issued by the new
CM
to all machines in the configuration except the machine that is suspected. Any machine that the read fails is also suspected. To avoid having the newCM
in a minority in case of a network partition, the newCM
only proceeds with reconfiguration if it received responses from a majority of the probes. - Update configuration: After receiving replies to the probes, the new
CM
attempts to update the configuration data stored on ZooKeeper. - Remap regions: Regions that were previously mapped to failed machines are
reassigned so that the number of replicas come back to
f + 1
. If a primary failed, a surviving backup is promoted to be the new primary. If theCM
detects that regions lost all their replicas or if there is no space to re-replicate regions, it signals an error. - Send new configuration: A
NEW-CONFIG
message is sent to all machines in the new configuration after theCM
has remapped all regions.NEW-CONFIG
also resets the lease protocol if theCM
has changed. - Apply new configuration: When a machine receives a
NEW-CONFIG
message with a configuration identifier that is greater than its own, it updates its current configuration identifier and its copy of the region mapping. It also allocates space to hold any new region replicas assigned to it. From this point on, the machine starts rejecting requests from machines that are not part of this new configuration and it does not issue new requests to those machines. Requests from external clients are also blocked. The machine will send aNEW-CONFIG-ACK
message toCM
as well. - Commit new configuration: After receiving
NEW-CONFIG-ACK
messages from all machines in the configuration, theCM
waits to ensure that any leases granted in previous configurations to machines not present in the new configuration have expired. Then it sends aNEW-CONFIG-COMMIT
message to all configuration members. All members now unblock external requests and initiate transaction recovery.
Transaction state recovery
After reconfiguration, FaRM recovers transaction state using the logs distributed across the replicas of objects modified by a transaction. This involves recovering the state both at the replicas of objects modified by the transaction and at the coordinator to determine the outcome of the transaction. This recovery is done by the following steps:
- Block access to recovering regions: When the primary of a region fails, the backup is promoted to be the new primary. Access to this region is blocked until all transactions that updated it have been reflected at the new primary. Local access and RDMA references for the region in question are blocked until step 4 which is when the write locks have been acquired by all recovering transactions that updated the region.
- Drain logs: When trying to commit a transaction, the coordinators reply
successfully to the application after receiving acks from NICs for
COMMIT-BACKUP
andCOMMIT-PRIMARY
messages. This just guarantees that the messages have been written to the logs but they might not have been processed yet. During transaction state recovery, the machines then drain the logs to ensure that all relevant records are processed during recovery. All records in the logs are processed in order after receiving aNEW-CONFIG-COMMIT
message. At the end of it aLastDrained
variable that stores the configuration identifier is updated. ThisLastDrained
variable is used to reject log records for transactions with configuration identifiers that are less than or equal toLastDrained
. - Find recovering transactions: A recovering transaction is one whose commit
phase spans configuration changes, and for which some replica of a written
object, some primary of a read object, or the coordinator has changed due to
reconfiguration. Only recovering transactions2 go through transaction recovery
at primaries and backups. Records for a recovering may be distributed over
the logs of different primaries and backups updated by the transaction. Each
backup of a region sends a
NEED-RECOVERY
message to the primary with the configuration identifier, the region identifier, and the identifiers of recovering transactions that updated the region. - Lock recovery: The primary of each region waits until the local machine
logs have been drained and
NEED-RECOVERY
messages have been received from each backup, to build the complete set of recovering transactions that affect the region. The transactions are then sharded by identifier across its threads such that each threadt
recovers the state of transactions with coordinator thread identifiert
. In parallel, the threads from the primary fetch any transaction log records from backups that are not already stored locally and then lock any objects modified by the recovering transactions. The region is active when lock recovery is complete. Then local and remote coordinators can obtain local pointers and RDMA references, allowing them to read and commit updates to this region in parallel with subsequent recovery steps. - Replicate log records: The threads in the primary replicates log records by
sending backups the
REPLICATE-TX-STATE
message for any transactions that they are missing. - Vote: The coordinator for a recovering transaction decides whether to
commit or abort the transaction based on votes from each region updated by
the transaction. The votes are sent by the primaries of each region.The
threads in the primary send
RECOVERY-VOTE
messages to their peer threads in the coordinator for each recovering transaction that modified the region. The rules for voting are the following: * If any replica sawCOMMIT-PRIMARY
orCOMMIT-RECOVERY
then vote commit-primary. * If any replica sawCOMMIT-BACKUP
and did not seeABORT-RECOVERY
then vote commit-backup. * If any replica sawLOCK
and noABORT-RECOVERY
then vote lock. * Otherwise vote abort. If the primaries do not have any log records for the transaction they vote truncated if the transaction has already been truncated or unknown if it has not. - Decide: The coordinator decides whether or not the transaction should
commit. It decides to commit if it receives a commit-primary vote from any
region. Otherwise, it waits for all regions to vote and commits if at least
one region voted commit-backup and all other regions voted lock,
commit-backup, or truncated. Otherwise it decides to abort. Then the
coordinator sends
COMMIT-RECOVERY
orABORT-RECOVERY
to all participant replicas. ATRUNCATE-RECOVERY
message is sent once the coordinator receives acks from all primaries and backups.
Recovering data
To ensure that it can tolerate f
replica failures in the future for a given
region, FaRM must re-replicate data at new backups. Each machine sends a
REGIONS-ACTIVE
message to the CM
when all regions for which it is primary
become active. The CM
after receiving all REGIONS-ACTIVE
messages sends a
ALL-REGIONS-ACTIVE
message to all machines in the configuration. At this
point, FaRM begins data recovery for new backups in parallel with foreground operations.
A new backup for a region initially has a freshly allocated and zeroed local region replica. The regions are divided across worker threads that recover them in parallel. Each thread issue one-sided RDMA operations to read a block at a time from the primary. Before being copied to the backup the recovered object is examined to see if a modification (by a new transaction) has been made while the backup was receiving the object.
Recovering allocator state
The FaRM allocator splits regions into blocks of 1 MB that are used as slabs for
allocating small objects. Two pieces of meta-data re kept: block headers,
which contain the object size, and slab free lists. The block headers are
replicated to backups when a new block is allocated. Even if the primary fails
they will be available at the new primary (which will be one of the backups). To
avoid inconsistencies when the old primary fails while replicating the block
header, the new primary sends all block headers to all backups immediately after
receiving NEW-CONFIG-COMMIT
.
The slab free lists are kept only at the primary but each object has a bit in its header that is set by an allocation and cleared by a free during transaction execution. After a failure, the free lists are recovered in the new primary by scanning the objects in the region.
Conclusion
In the paper’s introduction the authors say that prior attempts to implement the abstraction of “a single machine that never fails and that executes one transaction at a time in an order consistent with real time” in a distributed system resulted in poor performance. They then cite several systems that sacrificed some characteristic; like strong consistency guarantees, transactions, etc; to have better performance.
That is what led them to build FaRM. The paper is a demonstration that new systems don’t need to compromise those characteristics in order to have more performance. The ideas presented in the paper reflect this desire: reducing the number of messages exchanged, using one-sided RDMA to avoid using the CPU, storing the data in non-volatile DRAM, etc.
The performance results presented in the paper are interesting. FaRM achieves a peak throughput of 140 million TATP transactions per second on 90 machines, it recovers from a failure under 50 milliseconds, and it achieved 4.5 million TPC-C “new order” transactions per second. You can consult the paper for more numbers.
References
- Aleksandar Dragojević, Dushyanth Narayanan, Edmund B Nightingale, Matthew Renzelmann, Alex Shamis, Anirudh Badam, and Miguel Castro. No compromises: distributed transactions with consistency, availability, and performance. In Proceedings of the 25th Symposium on Operating Systems Principles, October 04-07, 2015, Monterey, California.
Notes
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK