article December 16th, 2021

Unbuffered I/O Can Make Your Rust Programs Much Slower

In this post, we will look at a common source of bad performance on Rust code that can trip up even veteran developers, and what you can do when this happens in your programs.

Rust’s reputation as a language that enables developers to write fast and safe code is now well established. Every day, large organizations like Mozilla, Microsoft, Dropbox, and Amazon (to name a few) rely on Rust to deliver best-in-class performance to their customers while avoiding many of the safety issues that affect programs written in C or C++, the languages more traditionally used for high-performance work.

At Era Software, we value performance and we believe that by making products that are fast, we can help our customers get more out of their data while reducing their monthly infrastructure costs. Performance is a major reason why Rust is our language of choice. Yet, just writing code in Rust is not a guarantee of high performance. Rust is good, but it’s not magic. It’s a tool, and as with any tool, we must use it effectively to obtain optimal results.

In this post, we will look at a common source of bad performance in Rust code that can trip up even veteran developers. That is, by default, reads and writes on files are not buffered.

Syscalls

Programs cannot directly read to or write from files on disk, they must ask the operating system to do that via a system call, or syscall. For example, in Linux, the kernel provides the write() syscall to transfer data from a program to a file. Our programs can write data to a file by calling write() with three parameters: a file descriptor, a pointer to the bytes we want to write, and how many of those bytes to write.

A characteristic of Linux syscalls is that they are slower to invoke than regular functions. This is because they must switch from executing in user mode to executing in kernel mode, and that switch is costly. To ensure good performance, our programs should avoid making excessive syscalls.

I/O with and without buffering

Here is a simple Rust program that writes a few lines to a file. This program does not take advantage of buffering, which means that there will be one write() syscall for each of the three calls to the f.write() method.

use std::fs;
use std::io::{self, Write};

fn main() -> io::Result<()> {
    let mut f = fs::File::create("/tmp/unbuffered.txt")?;
    f.write(b"foo")?;
    f.write(b"\n")?;
    f.write(b"bar\nbaz\n")?;
    return Ok(());
}

Running this program in strace, we can see that there are indeed three write() syscalls:

$ strace --trace=write ./target/release/01_unbuffered
write(3, "foo", 3)                      = 3
write(3, "\n", 1)                       = 1
write(3, "bar\nbaz\n", 8)               = 8

In this tiny example, the performance impact is minuscule, but in a real program that processes large data files, having millions, even billions, of frivolous syscalls results in a slower program and unhappy users.

Fortunately, we can improve our program with a one-line change. After we open the file, we can wrap it inside a BufWriter object.

use std::fs;
use std::io::{self, BufWriter, Write};

fn main() -> io::Result<()> {
    let mut f = BufWriter::new(fs::File::create("x.txt")?);
    f.write(b"foo")?;
    f.write(b"\n")?;
    f.write(b"bar\nbaz\n")?;
    return Ok(());
}

Now, when we call f.write(), we are not invoking the write() syscall, we are just appending the bytes to an array inside the buffered wrapper. This happens entirely in user mode, so it’s very cheap. Only when the buffer gets full, or when we close the file, is a syscall made to transfer the bytes to disk. We can confirm that this is true with strace.

$ strace --trace=write ./target/release/02_buffered
write(3, "foo\nbar\nbaz\n", 12)         = 12

Buffering is all about amortization. We have to use syscalls to write our data to disk and pay the cost of calling them; however, we can be smart about it and issue fewer syscalls that pass more data at once.

Deserialization and buffering

Problems can easily show up in Rust programs when buffering isn’t used. Take serde_json, for example, a library to read and write data in JSON with an easy-to-use interface. Its from_reader() function accepts any object which implements the Read trait and will decode the bytes into a JSON tree. The File type implements the Read trait, so we can very easily decode a file on disk. Here is a simple program which does this, and its File is intentionally not wrapped in a BufReader object.

use std::fs;
use std::io;

fn main() -> io::Result<()> {
    let mut f = fs::File::open("sample.json")?;
    let v: serde_json::Value = serde_json::from_reader(&mut f).unwrap();
    println!("{}", v.is_object());
    return Ok(());
}

We can use perf to count how many read() syscalls were made in the execution of this program.

$ sudo perf stat -e syscalls:sys_enter_read ./target/release/04_unbuffered_json
 Performance counter stats for './target/release/04_unbuffered_json':
         2,009,119      syscalls:sys_enter_read

The sample.json file is 2,009,108 bytes. To deserialize the file, serde_json makes one syscall per byte! (The extra 11 read() syscalls occur at the beginning of the program to load libc, etc.) Our trusty strace confirms that this is the case.

$ strace --trace=read ./target/release/04_unbuffered_json
...
read(3, "{", 1)                         = 1
read(3, "\"", 1)                        = 1
read(3, "t", 1)                         = 1
read(3, "y", 1)                         = 1
read(3, "p", 1)                         = 1
read(3, "e", 1)                         = 1
read(3, "\"", 1)                        = 1
read(3, ":", 1)                         = 1
...

When we fix the program by wrapping our file inside a buffered reader, the results are striking. We reduce the number of syscalls by a factor of almost 8000 – which makes a lot of sense, because we read 8192 bytes at a time instead of just one – and the program runs 11 times faster.

$ sudo perf stat -e syscalls:sys_enter_read ./target/release/05_buffered_json
 Performance counter stats for './target/release/05_buffered_json':
               257      syscalls:sys_enter_read


$ strace --trace=read ./target/release/05_buffered_json
...
read(3, "{\"type\":\"FeatureCollection\",\"crs"..., 8192) = 8192
read(3, "6200000000001}},{\"type\":\"Feature"..., 8192) = 8192
read(3, "egion\":\"AK\",\"category\":\"In-betwe"..., 8192) = 8192
read(3, "01}},{\"type\":\"Feature\",\"id\":95,\""..., 8192) = 8192
...

$ hyperfine -w 5 -m 30 \
    ./target/release/04_unbuffered_json \
    ./target/release/05_buffered_json
Benchmark #1: ./target/release/04_unbuffered_json
  Time (mean ± σ):     326.3 ms ±   8.1 ms    [User: 70.2 ms, System: 256.0 ms]
  Range (min … max):   312.2 ms … 346.8 ms    30 runs

Benchmark #2: ./target/release/05_buffered_json
  Time (mean ± σ):      28.5 ms ±   1.4 ms    [User: 22.9 ms, System: 5.6 ms]
  Range (min … max):    26.2 ms …  33.2 ms    106 runs

Summary
  './target/release/05_buffered_json' ran
   11.43 ± 0.63 times faster than './target/release/04_unbuffered_json'

We recently found this exact problem in one of our products at Era Software. We were investigating an unrelated issue when we noticed that a 600 MiB file took more than thirty seconds to be deserialized. Because serializing the file only took a second, there was clearly something fishy going on. We investigated and quickly found exactly what we discussed in this post – we opened the file and deserialized it, but forgot to wrap the file in a BufReader. We changed the code as we’ve shown in this post and brought the deserialization time down to just one second.

I want to highlight that although the original code was written by a senior developer who knows this stuff and was reviewed by multiple senior developers who know this stuff, this performance bug still managed to escape their vigilance and wiggle its way into our main branch. Like I said in the introduction, it’s easy even for veterans to miss this problem!

How to find these issues?

Alright, so we know about the perils of unbuffered I/O, but we also know that it can escape the vigilance of experienced programmers, so what can we do? At the moment, Rust doesn’t have an automatic way to inform you of these issues – the compiler will not issue a warning, and Clippy doesn’t have a lint that informs you that your I/O is unbuffered.

However, we saw that strace can be useful – it told us that the read() and write() syscalls processed only a single byte. So let’s see how we can use strace (and a little helping of awk) to (1) find whether we have a lot of single-byte reads or writes and (2) use the stack trace functionality of strace to see where in our program those single-byte reads or writes occur.

As an example, we’ll use the unbuffered program from the previous section that reads a JSON file. To make the output more readable, we’ll use this tiny JSON payload: {"id":42}. Let’s run the program in strace! We use the --trace option to keep only the calls to read() and use awk to keep syscalls that read a single byte (that is, lines that end with = 1). The strace option --decode-fds=path tells us to which file the descriptor 3 refers (in this case, /tmp/simple.json) which can be helpful to know what part of the program is at fault.

$ strace --decode-fds=path \
         --trace=read \
         /tmp/04_unbuffered_json 2>&1 |
    awk '/= 1$/'
read(3</tmp/simple.json>, "{", 1)       = 1
read(3</tmp/simple.json>, "\"", 1)      = 1
read(3</tmp/simple.json>, "i", 1)       = 1
read(3</tmp/simple.json>, "d", 1)       = 1
read(3</tmp/simple.json>, "\"", 1)      = 1
read(3</tmp/simple.json>, ":", 1)       = 1
read(3</tmp/simple.json>, "4", 1)       = 1
read(3</tmp/simple.json>, "2", 1)       = 1
read(3</tmp/simple.json>, "}", 1)       = 1
read(3</tmp/simple.json>, "\n", 1)      = 1

We can see from all the lines ending with = 1 that our program is indeed doing single-byte reads, and the data is our JSON content laid out vertically. So we’ve successfully used strace to verify issue (1)!

Now, how do we find where these reads are made? Fortunately for us, strace has a very useful flag, --stack-traces, to display the sequence of function calls which resulted in the syscall. We’ll now make our awk program a little more complex to display the stack trace associated with a single-byte read. Here’s an explanation of how it works:

If the current line ends with = 1, we set the variable show to 1 (true).
If the current line ends with an equal sign followed by something other than the digit 1 (that is, a read that returned more than one byte), we set the variable show to 0 (false).
We show the current line when show == 1.

Here’s the output for the syscall that reads the first opening brace of our file.

$ strace --stack-traces \
         --decode-fds=path \
         --trace=read \
         /tmp/04_unbuffered_json 2>&1 |
    awk '/= 1$/ { show = 1 } /= [^1]$/ { show = 0 } show'
read(3</tmp/simple.json>, "{", 1)       = 1
 > /usr/lib/x86_64-linux-gnu/libpthread-2.33.so(read+0x12) [0x13152]
 > /tmp/04_unbuffered_json(<std::fs::File as std::io::Read>::read+0x23) [0x1fbd3]
 > /tmp/04_unbuffered_json(<std::io::Bytes<R> as core::iter::traits::iterator::Iterator>::next+0x33) [0xca53]
 > /tmp/04_unbuffered_json(_ZN10serde_json5value2de77_$LT$impl$u20$serde..de..Deserialize$u20$for$u20$serde_json..value..Value$GT$11deserialize17h23bf1ff9e8286bd9E.llvm.16551263557485243796+0x8ec) [0xb39c]
 > /tmp/04_unbuffered_json(serde_json::de::from_reader+0x48) [0x9fb8]
 > /tmp/04_unbuffered_json(_04_unbuffered_json::main+0x78) [0xc7d8]
 > /tmp/04_unbuffered_json(std::sys_common::backtrace::__rust_begin_short_backtrace+0x3) [0xc583]
 > /tmp/04_unbuffered_json(_ZN3std2rt10lang_start28_$u7b$$u7b$closure$u7d$$u7d$17h3d55d3d7814f859cE.llvm.14713913286507758235+0x9) [0xc929]
 > /tmp/04_unbuffered_json(std::rt::lang_start_internal+0x30a) [0x2650a]
 > /tmp/04_unbuffered_json(main+0x22) [0xc912]
 > /usr/lib/x86_64-linux-gnu/libc-2.33.so(__libc_start_main+0xd5) [0x28565]
 > /tmp/04_unbuffered_json(_start+0x2e) [0x8e3e]

We see that the read() syscall was invoked while we were deserializing JSON (serde_json::de::from_reader) in the main() function. With the file name and the function in hand, that should narrow our search and help us find where we need to add buffering. And that takes care of issue (2)!

Conclusion

In this post, we saw that:

System calls in Linux are slower than regular functions
Issuing too many syscalls can have very negative effects on run-time performance
By using BufReader and BufWriter, we can amortize the cost of syscalls
Even experienced programmers can miss these issues
We can use strace and awk to find if and where unbuffered I/O happens in our programs

I want to end by saying that if this type of issue happens in your own programs, don’t feel bad about it – we’re all only human, we can’t avoid making mistakes. And when mistakes happen, make the best of them. Use them as examples to remind developers on your team that files in Rust are not buffered by default, which is different from languages they may be familiar with, like Python. If you have junior developers on your team, take some time out of your schedule to help them understand the issue. Help people acquire a “spidey sense” that tingles when they see the word File without the words BufReader or BufWriter nearby. It won’t prevent this type of problem from happening ever again, but there will be more eyeballs on the lookout.

At Era Software, we strive for our culture to be one of kindness as well as technical excellence. We’re assembling a team of experts to innovate in distributed systems, machine learning, and database engineering. If you’re interested in our work, check out our company and career page.

Unbuffered I/O Can Make Your Rust Programs Much Slower

Unbuffered I/O Can Make Your Rust Programs Much Slower

Syscalls

I/O with and without buffering

Deserialization and buffering

How to find these issues?

Conclusion

Vincent Foley

Sr. Software Engineer

Recommend

零信任网络的微服务基本要素概述

Detect filesystem loop during walking the projects by weihanglo · Pull Request #...

简单拍照对比：小米12 Pro对小米11、小米11 Ultra、小米10 Pro

Futuristic Rust: context emulation

云端 IDE 是软件工程的未来吗？

P2123 interfaces: A Facility to Manage ABI/API Evolution · Issue #842 · cplusplu...

为什么现在对 Web3 感到兴奋还为时过早？

BonsaiDb December Update: Finishing up alpha 1

Make levenshtein distance case insensitive. by MidasLamb · Pull Request #10224 ·...

Avoid sorting in hash map stable hashing by Kobzol · Pull Request #91837 · rust-...

About Joyk