An unexpected performance regression

Performance regressions are something that I find rather hard to track in an automated way. For the past years, I have been working on a tool called fd , which is aiming to be a fast and user-friendly (but not necessarily feature-complete) alternative to find .

As you would expect from a file-searching tool, fd is an I/O-heavy program whose performance is governed by external factors like filesystem speed, caching effects, as well as OS-specific aspects. To get reliable and meaningful timing results, I developed a command-line benchmarking tool called hyperfine which takes care of things like warmup runs (for hot-cache benchmarks) or cache-clearing preparation commands (for cold-cache benchmarks). It also performs an analysis across multiple runs and warns the user about outside interference by detecting statistical outliers¹.

But this is just a small part of the problem. The real challenge is to find a suitable collection of benchmarks that tests different aspects of your program across a wide range of environments. To get a feeling for the vast amount of factors that can influence the runtime of a program like fd , let me tell you about one particular performance regression that I found recently².

I keep a small collection of old fd executables around in order to quickly run specific benchmarks across different versions. I noticed a significant performance regression between fd-7.0.0 and fd-7.1.0 in one of the benchmarks:

R3yAVbM.png!web

I quickly looked at the commits between 7.0 and 7.1 to see if there were any changes that could have introduced this regression. I couldn't find any obvious candidates.

Next, I decided to perform a small binary search by re-compiling specific commits and running the benchmark. To my surprise, I wasn't able to reproduce the fast times that I had measured with the precompiled binaries of the old versions. Every single commit yielded slow results!

There was only one way this could have happened: the old binaries were faster because they were compiled with an older version of the Rust compiler . The version that came out shortly before the fd-7.1.0 release was Rust 1.28 . It made a significant change to how Rust binaries were built: it dropped jemalloc as the default allocator.

To make sure that this was the root cause of the regression, I re-enabled jemalloc via the jemallocator crate. Sure enough, this brought the time back down:

7Bv6vqZ.png!web

Subsequently, I ran the whole "benchmark suite". I found a consistent speed up of up to 40% by switching from the system-allocator to jemalloc (see results below). The recently released fd-7.4.0 now re-enables jemalloc as the allocator for fd .

Unfortunately, I still don't have a good solution for automatically keeping track of performance regressions - but I would be very interested in your feedback and ideas.

Benchmark results

Simple pattern, warm cache:

Command Mean [ms] Min [ms] Max [ms] Relative fd-sysalloc '.*[0-9]\.jpg$' 252.5 ± 1.4 250.6 255.5 1.26 fd-jemalloc '.*[0-9]\.jpg$' 201.1 ± 2.4 197.6 207.0 1.00

Simple pattern, hidden and ignored files, warm cache:

Command Mean [ms] Min [ms] Max [ms] Relative fd-sysalloc -HI '.*[0-9]\.jpg$' 748.4 ± 6.1 739.9 755.0 1.42 fd-jemalloc -HI '.*[0-9]\.jpg$' 526.5 ± 4.9 520.2 536.6 1.00

File extension search, warm cache:

Command Mean [ms] Min [ms] Max [ms] Relative fd-sysalloc -HI -e jpg '' 758.4 ± 23.1 745.7 823.0 1.40 fd-jemalloc -HI -e jpg '' 542.6 ± 2.7 538.3 546.1 1.00

File-type search, warm cache:

Command Mean [ms] Min [ms] Max [ms] Relative fd-sysalloc -HI --type l '' 722.5 ± 3.9 716.2 729.5 1.37 fd-jemalloc -HI --type l '' 526.1 ± 6.8 517.6 539.1 1.00

Simple pattern, cold cache:

Command Mean [s] Min [s] Max [s] Relative fd-sysalloc -HI '.*[0-9]\.jpg$' 5.728 ± 0.005 5.723 5.733 1.04 fd-jemalloc -HI '.*[0-9]\.jpg$' 5.532 ± 0.009 5.521 5.539 1.00

¹ For example, I need to close Dropbox and Spotify before running fd benchmarks as they have a significant influence on the runtime.

² As stated in the beginning, I don't have a good way to automatically track this. So it took me some time to spot this regression :-(

Benchmark results

Recommend

人人都想学架构（三）

90行代码，15个元素实现无限滚动

详解|天猫搜索前端技术历代记

科普 | 比特币的治理

V2Ray高级技巧：流量伪装

A Common Gotcha with Asynchronous GPU Computing

通告 | BLOXROUTE 在以太坊主网上的测试结果

Axios or fetch(): Which should you use?

Protocol found in webcams and DVRs is fueling a new round of big DDoSes

论怎么玩游戏，这个AI能说得头头是道

About Joyk