John Fremlin's blog: Remote Procedure Calls at scale

Waiting for updates: connected

Posted 2022-04-01 22:00:00 GMT

Large scale distributed systems can feel like chemical plant. Issues can slowly build up like rising pressure and then take a while to clean up like a chemical spill. My experience with several systems using hundreds of thousands of machines or millions of cores is that you can't just treat remote procedure call APIs as if each function call is fully independent.

Imagine just one server with a broken disk that keeps returning errors. It's load will be low so it can take a disproportionate amount of client traffic. Imagine a client that fails to read the response, and retries. It's like a denial of service attack.

Healthcheck and heartbeat. Best practice for servers is to have a healthcheck method that the load balancer checks with before letting clients call the server. This method shouldn't be left as an empty return OK. It should confirm that all expected dependencies (e.g. downstream databases) are accessible and all configuration is present and error free. At scale, where there is always a weird machine with a system package going haywire on a broken kernel version, you also need to sanity check basic things, like that there is roughly adequate disk and CPU bandwidth.

Beyond the healthcheck communication to the load balancer or cluster manager, it's important to have a heartbeat with the client. The heartbeat should be called periodically by the client, and it's beneficial if on first use of the server-client pair. The heartbeat has multiple objectives

healthcheck for connection; if heartbeat fails or is slow, the RPC layer or network is broken, and this isolates that
share versions (and experiment variants) for compatibility tweaks on both sides
allow server to instruct the client to disable (sleep), or go to another server
update client settings like retry aggressiveness, flow control buffers and rates

It's really important to have strong integration test guarantees that clients respond well to the sleep, go to another server, and flow control update commands. Software inventory management at scale is very hard. Some large service might suddenly be restarted with an old version that boots up very aggressively. To contain this explosion of load, the servers activate the heartbeat update to moderate the clients, targeted closely at the misbehaving ones. This means the heartbeat has to have enough information to semantically isolate problematic categories of client, i.e. full details of which application and location and so on. If you can only filter on user account or network location it may be hard to differentiate between clients with big impact or experimental clients and batch systems that are resilient to downtime. When the server is unhealthy, it needs to tell clients to go away.

Call pacing. At scale an API is also like a flow. The system as a whole may aspire to be elastically scalable but any one single server machine or network is not. After successful and unsuccessful calls, it's really useful if the server can set a guideline for how long to wait for the next call or retry (in general, having the ability to update any heartbeat setting in the call response metadata). Allowing the server to guide the clients use of connection pools and buffers also helps contain overload. The best place to contain overload is at the source. Throttling after something is sent by the client is more expensive than not generating the request. And sometimes the server may find it hard to guess how much work will be needed for a request. If the client can set a short deadline and indicate the call should fail if it needs buffers above a certain size, the server can prioritize and schedule better.

Observability. If it's a retry or a hedged request, the server should be told that. Otherwise, a single failure can look like many. Ideally the server should be able to log enough information to connect to client logs and tracing, and to differentiate between classes of requests by the same client. It's generally useful to combine some fixed agreed naming (e.g. LOW_PRIORITY or ADS_HIGH_PRIORITY) with unstructured tags (e.g. "experment_job=try_123"). Both are useful. If the client logs timing information for the previous call when making the next call, the gap between server and client measurements can be closed without additional RPCs or other logging destinations.

The general reasoning: the server is easier to reconfigure than clients and has more system context. If there is just one error a client might as well retry immediately. If the system is overloaded and cannot process requests, low priority clients should wait for a long time to retry. The client doesn't have this context. The above mechanisms help make the system manageable. They could theoretically be integrated into the RPC layer but normally there are additional application settings useful in the heartbeats and metadata.

It's best to have clients control load before sending it. However, a common failure situation is a large client deployment starting up or restarting in rapidly in a loop. Then even the first API calls may overwhelm. Other load control mechanisms and defense in depth is important. Ideally both servers and clients are able to share capacity and load expectations, so for example, the server doesn't try sending many huge responses to a client that can't expect to process them immediately.

Hope this helps! Big systems are awesome and don't need to be scary!

John Fremlin's blog: Remote Procedure Calls at scale

John Fremlin's blog: Remote Procedure Calls at scale

Waiting for updates: connected

Recommend

vim-board 1.08.8 -- Multifunction writable board

纽约市的苹果 Apple Store 工人开始成立工会

HDD干不过SSD 硬盘三兄弟持续失血：出货量暴跌17%

“微信农场”来了，年轻人会买账吗？

GitHub - warp-tech/warpgate: Smart SSH bastion that works with any SSH client

Form Chimp - Web forms made simple A backend solution for your Forms | Product...

校车司机一个动作救了14人！而他……

区块链基础设施初创公司 Blockdaemon 完成新一轮融资，Citi Ventures 领投

单休 6k，双休 4.5k，你们选哪个?

对比研究 NEAR 生态借贷项目：Burrow、Aurigami 及 Bastion

About Joyk