Erlang's Tail Recursion is Not a Silver Bullet
source link: https://ferd.ca/erlang-s-tail-recursion-is-not-a-silver-bullet.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Erlang's Tail Recursion is Not a Silver Bullet
One common belief held by Erlang programmers has to do with tail recursion generally being the best way around when writing code. It's absolutely vital when writing long-running processes (you do not want to go out of memory because of your stack!), I'm not debating that here. However, most sources show how to write Erlang functions by always favoring tail recursive solutions. I know Learn You Some Erlang does fall into the same trap.
Why do I call it a trap? Well, because tail recursion is sometimes vital but body recursion never is, it's kind of normal to always show examples with tail recursion so it becomes absolutely natural to everyone. It's also easier to avoid problems because bad tail recursion often has fewer negative side-effects than bad body recursion (the fibonacci function comes to mind here). However, in some cases, tail recursion is actually slowing your application down The cases where this happens include pretty much every function where you build a list. Now I figure this is somewhat limited given how many possible data structures can be used, but the truth is, lists are everywhere and will almost always be a sizeable portion of the code you write in Erlang.
Let's go with an example, the flush function. The flush function will look at the mailbox and remove messages from there until nothing's left. My version here will accumulate the messages in the order they happened:
tail_flush() -> tail_flush([]). tail_flush(Acc) -> receive Msg -> tail_flush([Msg|Acc]) after 0 -> lists:reverse(Acc) end.
This is a rather standard tail recursive function. Here's the body-recursive version:
flush() -> receive Msg -> [Msg|flush()] after 0 -> [] end.
I think few Erlang programmers would disagree with me if I were to say that flush/0
is much simpler and elegant than tail_flush/0
. It's easier to understand and more idiomatic. But here's the thing: the tail recursive one is likely the one we'll see the most in existing code (without proof).
I would dare say that tail recursion is not important here. We can just drop it and go with the simplest way of writing code. In both cases, I basically end up creating a list, which will hold all the values of all messages. In one case, I will be creating a list of N elements to be carried on the stack, in the other case, I'll be having a stack that is about the size of N elements, right? The difference is that the tail recursive version will create a lot of garbage by just dropping the accumulator after reversing it. A step by step explanation to this problem and hypothesis were given by Raimo Niskanen on the Erlang mailing list. Here's the theoretical conclusion:
tail recursion should require a top memory consumption of 2 * size(MsgList) words more than body recursion. This will cause extra garbage collection. Also the lists:reverse/1 is extra work that the body recursive variant avoids. In this case I speculate body recursion wins.
So I decided to test this with a crappy micro-benchmark. I usually do not like to trust micro-benchmarks as they are a bit unreliable if you don't think of everything, but still, this seemed a good proof enough for a blog post:
-module(bench). -compile(export_all). main() -> L = lists:seq(1,100000), F = fun(X) -> X end, spawn(fun() -> io:format("Tail:~p~n", [element(1, timer:tc(?MODULE, tail_map, [F, L]))]) end), spawn(fun() -> io:format("Body:~p~n", [element(1, timer:tc(?MODULE, body_map, [F, L]))]) end). tail_map(F, L) -> tail_map(F, L, []). tail_map(_, [], Acc) -> lists:reverse(Acc); tail_map(F, [H|T], Acc) -> tail_map(F, T, [F(H)|Acc]). body_map(_, []) -> []; body_map(F, [H|T]) -> [F(H) | body_map(F, T)].
So each process will iterate (while doing nothing) over a list of 100,000 integers. Fairly simple. Here are the results of a few runs, in microseconds:
Body:9384 Tail:14556 Body:11150 Tail:16260 Body:9419 Tail:14732 Body:11386 Tail:17946 Body:11380 Tail:16512 Body:10885 Tail:16876 Body:9923 Tail:14085
Body recursion is consistently faster. This might have to do with the size of each process and garbage collection. Quick tracing showed that in the case of this benchmark, with the arguments given to the function, the body recursive version had to be garbage collected once, while the tail recursive version had to be garbage collected twice. This would seem to support the theory. But hey, this could be related to some concurrency or allocation of processes and whatnot. So here I'm adding another overly simplistic benchmark, where both calls will run one after the other, in a single process:
main2() -> spawn(fun() -> L = lists:seq(1,100000), F = fun(X) -> X end, tail_map(F,L), body_map(F,L), { {A,_}, {B,_}} = {timer:tc(?MODULE, tail_map, [F, L]), timer:tc(?MODULE, body_map, [F, L])}, io:format("Tail:~p~nBody:~p~n", [A,B]) end).
Here I'm running both functions once before the timing. In case I ever need to resize the existing process so everything won't need to be garbage collected as much:
Tail:5329 Body:8648 Tail:5292 Body:6873 Tail:6738 Body:8798 Tail:5718 Body:6818 Tail:5266 Body:8669 Tail:5470 Body:6716 Tail:8108 Body:8952 Tail:5461 Body:6545 Tail:5648 Body:8607 Tail:5424 Body:7153
Aw crap. All my pretty conclusions, gone. Now body recursion is slower than tail recursion. However, not all is lost. What we can see is that body recursion is somewhat stable: running the function once or many times seems to yield very similar performances. In the case of tail recursion, having all the required space allocated beforehand and the GC having done its thing already seems to vastly improve the run time compared to a cold start.
Moreover, if I change the benchmarks a bit so that each of the looping functions runs a few hundred times within their own process, body recursion becomes faster again:
main3() -> L = lists:seq(1,100000), F = fun(X) -> X end, spawn(fun() -> io:format("Tail:~p~n", [element(1, timer:tc(?MODULE, m_tail_map, [F, L, 500]))]) end), spawn(fun() -> io:format("Body:~p~n", [element(1, timer:tc(?MODULE, m_body_map, [F, L, 500]))]) end). m_tail_map(_, _, 0) -> ok; m_tail_map(F, L, N) -> tail_map(F,L), m_tail_map(F, L, N-1). m_body_map(_, _, 0) -> ok; m_body_map(F, L, N) -> body_map(F,L), m_body_map(F, L, N-1).
And the results:
Body:2807180 Tail:3577161 Body:2810430 Tail:3644639 Body:2852699 Tail:3611445 Body:2898529 Tail:3699570 Body:3016723 Tail:3711662 Body:3027023 Tail:3712915 Body:2933290 Tail:3623219
What does this mean? Well if I can get ahead of myself and pretend I can trust these results for the general case (which is probably not the smartest thing to do), it'd be that it seems right to believe that tail recursion does require more work. Tracing the main/0
function for garbage collection revealed that it ended up needing more heap space towards the end of its run time when calling lists:reverse/1
. On the other hand, body recursion was able to do its thing within the space that was already allocated to it. It seems to be that this memory work is what makes tail recursion be slower for list building, although there are definitely cases where this is false.
This could also mean that using the most idiomatic form of recursion would be what you want to use when you need a lower memory footprint, sometimes more speed and cleaner code when building lists. This is a great property of Erlang: writing idiomatic and clean code often leads to optimal solutions. I would still say that it is very, very unlikely that the type of recursion you use when building lists would be the bottleneck of your application. In any case, to quote Raimo (who is quoting von Siemens), to measure is to know.
If you clicked on the Erlang mailing list link a few paragraphs earlier, you will have seen that Raimo added warnings relevent to this case:
In some other implementation where heap and stack are not in the same memory block the situation changes. I know of no such implementation yet. In the Erjang prototype stack space is limited so all body recursion should be avoided. That might be an unacceptable limitation since it impacts so hard at body recursion.
The measurements in this case were done on a single laptop, with a certain architecture, hardware and a given load on it, on the BEAM virtual machine, which has particular optimizations and a particular way to lay out a process' memory. These results are far from guaranteed to be repeatable in other settings (yay! no scientific proof here!). Measure, measure, measure.
One thing that's to remember no matter what is that writing clean code generally leads to better performances. This is something that many languages can not claim as true, but Erlang certainly can. In the general case, the optimisations in the Erlang VM come from observing idiomatic code in production, and clean one at that. This is why body recursion might be better, and why many other optimizations were made. The best recent example I can think of has to do with synchronous messaging patterns, where mailbox scanning has been optimized a little while ago. Clean code is the best code.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK