The hidden cost of exception handling
source link: https://grenouillebouillie.wordpress.com/2022/05/09/the-hidden-cost-of-exception-handling/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
During a meeting today, a colleague of mine shared the belief that exception handling had no impact on optimizations in modern C++. While we did everything we could 20+ years ago to ensure that all kinds of optimizations were possible, there is a residual cost that you can trigger. This is section 3.8 in the paper if you are curious.
Fibonacci computed using template instantiation
Let’s start with this simple C++ code:
template<unsigned N>
struct A
{
A(): i() {}
~A() {}
unsigned S() const { return i.S(); }
struct I : A<N-1>, A<N-2>
{
unsigned S() const
{
return A<N-1>::S() + A<N-2>::S();
}
} i;
};
template<> struct A<0> { unsigned S() const { return 1; } };
template<> struct A<1> { unsigned S() const { return 1; } };
int main()
{
A<SIZE> a;
return a.S();
}
What this code is doing is asking the compiler to compute the Fibonacci sequence using template instantiation. The Fibonacci sequence was chosen because it requires recursive instantiation, and as such, is somewhat costly in terms of number of calls.
Quizz: Why do I need struct I?
If we compile this code and look at the generated assembly code with -fnoexception, we get the following output:
% c++ -DSIZE=30 -std=c++20 /tmp/glop.cpp -O3 -S -o /tmp/glop.s -fno-exceptions && cat /tmp/glop.s
.section __TEXT,__text,regular,pure_instructions
.build_version macos, 12, 0 sdk_version 12, 3
.globl _main ## -- Begin function main
.p2align 4, 0x90
_main: ## @main
.cfi_startproc
## %bb.0:
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset %rbp, -16
movq %rsp, %rbp
.cfi_def_cfa_register %rbp
movl $1346269, %eax ## imm = 0x148ADD
popq %rbp
retq
.cfi_endproc
## -- End function
.subsections_via_symbols
So even for A<30>, the compiler managed to compute the constant value directly, and the generated code essentially puts that value, $1346269
, into a register. The optimizations work quite well in that case.
If we count the lines in the output assembly. we have 18. We will use that as a metric from here on.
When optimizations work, the cost can be zero
Now, if you compile the same code with exceptions, you get exactly the same result. So far, so good, the cost of exceptions was zero. The compiler could notably determine that it did not need to generate exception tables or landing pads.
Let us now make the same thing, but with a minor change, namely adding a constructor and a destructor in the inner class I that call two inline functions called construct() and destruct(), the rest of the code being the same:
inline void construct() {}
inline void destruct() {}
template<unsigned N>
struct A
{
[...]
struct I : public A<N-1>, public A<N-2>
{
I() { construct(); }
~I() { destruct(); }
[...]
} i;
};
Again, the generated code is exactly the same. The code size is still 18 lines. So far, so good. Still zero cost.
Optimizing side effect computations
Compilers are even smart enough nowadays to correctly evaluate side effects of the various calls on global variables. For example, consider this implementation of our little helper functions:
unsigned ctors = 0, dtors = 0;
inline void construct() { ctors++; }
inline void destruct() { dtors++; }
In that case, the compiler will be able to deduce how many times the functions would be called:
% c++ -DSIZE=30 -std=c++20 /tmp/glop.cpp -O3 -S -o /tmp/glop.s -fno-exceptions && cat /tmp/glop.s
[...]
addl $1346268, _ctors(%rip) ## imm = 0x148ADC
addl $1346268, _dtors(%rip) ## imm = 0x148ADC
movl $1346269, %eax ## imm = 0x148ADD
The size of our code is now 24, so it has not markedly increased. Great job, compiler!
Hiding the functions
Let us now assume that we hide the function definitions:
extern void construct(), destruct();
The generated code in that case becomes noticeably longer:
c++ -DSIZE=30 -std=c++20 /tmp/glop.cpp -O3 -S -o /tmp/glop.s -fno-exceptions && wc -l /tmp/glop.s
932 /tmp/glop.s
So the code is now 932 lines long, which is 51 times bigger than the initial one. In a sense, this is not overly surprising, since we added a bit of work to do. So the compiler generated a large number of actual function calls.
Interestingly, the number of calls generated does not match the number of dynamic calls computed statically. For A<30>, are 91 calls to construct, and 91 calls to destruct (10 of each being tail calls). For A<40>, there are 126 calls, 15 of them being tail calls.
Let’s make the compiler sweat
What we are doing with this kind of code is that we are putting a lot of pressure on the poor compiler. The problem with that is that the effort for the compiler is non-linear with the value of SIZE in the code above. As a matter of fact, if you try to increase the size, you quickly hit various compiler limits:
1 0.05 real 0.02 user 0.01 sys 18 0 0
2 0.04 real 0.02 user 0.01 sys 20 1 1
3 0.07 real 0.03 user 0.01 sys 22 2 2
4 0.05 real 0.03 user 0.01 sys 26 4 4
5 0.05 real 0.03 user 0.01 sys 32 7 7
6 0.05 real 0.03 user 0.01 sys 42 12 12
7 0.06 real 0.04 user 0.01 sys 58 20 20
8 0.06 real 0.04 user 0.01 sys 84 33 33
9 0.06 real 0.04 user 0.01 sys 132 34 34
10 0.07 real 0.05 user 0.01 sys 179 35 35
11 0.07 real 0.05 user 0.01 sys 199 37 37
12 0.08 real 0.05 user 0.01 sys 228 40 40
13 0.08 real 0.06 user 0.02 sys 285 45 45
14 0.08 real 0.06 user 0.01 sys 300 46 46
15 0.08 real 0.06 user 0.01 sys 336 52 52
16 0.09 real 0.06 user 0.01 sys 394 53 53
17 0.10 real 0.08 user 0.01 sys 446 54 54
18 0.09 real 0.07 user 0.01 sys 466 56 56
19 0.10 real 0.08 user 0.02 sys 497 59 59
20 0.10 real 0.08 user 0.02 sys 556 64 64
21 0.10 real 0.08 user 0.01 sys 571 65 65
22 0.11 real 0.08 user 0.01 sys 607 71 71
23 0.11 real 0.09 user 0.01 sys 665 72 72
24 0.11 real 0.09 user 0.02 sys 717 73 73
25 0.12 real 0.09 user 0.01 sys 737 75 75
26 0.12 real 0.10 user 0.01 sys 764 78 78
27 0.13 real 0.11 user 0.02 sys 823 83 83
28 0.14 real 0.11 user 0.01 sys 838 84 84
29 0.15 real 0.13 user 0.02 sys 874 90 90
30 0.16 real 0.13 user 0.02 sys 932 91 91
31 0.17 real 0.14 user 0.02 sys 984 92 92
32 0.19 real 0.17 user 0.01 sys 1004 94 94
33 0.22 real 0.20 user 0.02 sys 1031 97 97
34 0.27 real 0.24 user 0.02 sys 1090 102 102
35 0.35 real 0.32 user 0.02 sys 1105 103 103
36 0.46 real 0.43 user 0.01 sys 1141 109 109
37 0.65 real 0.62 user 0.02 sys 1199 110 110
38 0.96 real 0.93 user 0.02 sys 1251 111 111
39 1.45 real 1.42 user 0.02 sys 1271 113 113
40 2.25 real 2.20 user 0.03 sys 1298 116 116
41 3.57 real 3.51 user 0.04 sys 1357 121 121
42 5.94 real 5.81 user 0.08 sys 1372 122 122
43 9.52 real 9.33 user 0.11 sys 1408 128 128
44 15.16 real 14.83 user 0.16 sys 1466 129 129
45 23.99 real 23.62 user 0.22 sys 1518 130 130
46 40.89 real 39.67 user 0.37 sys 1542 132 132
47 62.51 real 61.13 user 0.51 sys 1578 135 135
48 /tmp/glop.cpp:27:5: warning: stack frame size (512559704) exceeds limit (4294967295) in function 'main' [-Wframe-larger-than]
int main()
^
1 warning generated.
103.61 real 101.16 user 0.95 sys 1645 140 140
49 /tmp/glop.cpp:27:5: warning: stack frame size (3483774824) exceeds limit (4294967295) in function 'main' [-Wframe-larger-than]
int main()
^
1 warning generated.
170.82 real 166.37 user 1.57 sys 1659 141 141
50 /tmp/glop.cpp:27:5: warning: stack frame size (3996334552) exceeds limit (4294967295) in function 'main' [-Wframe-larger-than]
int main()
^
1 warning generated.
266.99 real 262.95 user 2.29 sys 1715 147 147
In short, a modern compiler, on a machine with 16G of memory, starts having trouble compiling the code above for sizes above 48. It also takes close to 3 minutes to give me that result for value 49, and over 4 minutes for 50.
We make the compiler suffer, because we have a non-linear cost for compiling that code:
This is true despite the fact that the number of calls and the size of the code grows roughly linearly (which to be honest, was a surprise to me, it shows really smart optimizations):
Exception handling tables
Let us now compile with -fexceptions. Here, the assembly code jumps to 5202 lines of code for A<30>. In other words, the code became about 5 times larger. If we plot the code size relative to the size the problem, we get a roughly constant overhead factor:
This overhead corresponds to the amount of code that needs to be generated to deal with the various exit possibilities.
Code generation change
What is more interesting is if we contrast the number of calls to construct and destruct:
In the no-exception case, the number of generated calls to construct and destruct is identical, and is plotted in blue on the graph. In the exception case, the number of calls to destruct in the generated code is higher than the number of calls to construct. Presumably, the reason for this is that there are multiple distinct exit paths when destroying a partially-constructed object (e.g. if an exception is thrown from destruct(), the you end up calling terminate())
What is even more important is what happens at the left of the graph. Notably on the blue line, you see something that starts with what looks like a quadratic behavior, followed by a more linear one. What this tells us is that there are resource limits placed by the compiler for its optimization, something often called an optimization budget. For low-enough values of SIZE, the compiler probably is favoring some additional inlining, but beyond a certain limit, it has to disable inlining.
The obvious conclusion is that the code with and without exception handling is not identical. Once you reach the compiler optimization limits, the mere possibility of exceptions changes the generated code in an observable way.
Can we make it worse? Of course we can!
Let’s make the compiler suffer
See, so far, we were quite nice: we only gave a relatively simple code structure to the compiler. If you look at it, there is only one constructor for class A<10>, irrespective of where you are in the computation of the Fibonacci sequence.
The reason for selecting Fibonacci initially, however, was to be able to prevent the compiler from doing that optimization. In other words, we can make it so that the structure for the various classes being used in the computation depend on the path we follow along the computation. This can be done by embedding the actual computation path inside the class name. For example, changing the code a bit, we can do:
extern "C" void construct(), destruct();
template <unsigned N, typename T = void>
struct A
{
using D = A;
A() { construct(); }
~A() { destruct(); }
A<N-1, D> l;
A<N-2, D> r;
unsigned S() { return l.S() + r.S(); }
};
template<typename T> struct A<0, T> { unsigned S() { return 1; } };
template<typename T> struct A<1, T> { unsigned S() { return 1; } };
int main()
{
A<SIZE> a;
return a.S();
}
What we changed with this little update is that we now force the compiler to generate different constructors for all the possible computation paths. Each constructor can throw an exception, so now the landing pads and recovery code for that exception needs to be generated. And that applies recursively to all the objects inside.
Pretty quickly, the compiler gives up any idea of inlining, but even so, the number of calls to construct and destruct explodes, and the compiler now struggles dealing with cases as small as 20:
1 0.15 real 0.08 user 0.05 sys 18 0 0
2 0.17 real 0.09 user 0.06 sys 70 1 1
3 0.20 real 0.11 user 0.06 sys 171 2 4
4 0.22 real 0.14 user 0.06 sys 461 4 12
5 0.26 real 0.17 user 0.06 sys 829 7 21
6 0.30 real 0.22 user 0.06 sys 1369 12 34
7 0.41 real 0.32 user 0.07 sys 2383 20 58
8 0.61 real 0.50 user 0.08 sys 3944 33 95
9 0.90 real 0.77 user 0.09 sys 6406 54 154
10 1.29 real 1.15 user 0.10 sys 10535 88 252
11 2.02 real 1.82 user 0.13 sys 17133 143 409
12 3.53 real 3.21 user 0.19 sys 27747 232 662
13 5.60 real 5.09 user 0.29 sys 45065 376 1074
14 9.03 real 8.22 user 0.44 sys 73004 609 1739
15 14.18 real 13.04 user 0.62 sys 118148 986 2814
16 21.17 real 19.28 user 0.84 sys 191337 1596 4556
17 30.51 real 28.37 user 1.17 sys 309677 2583 7373
18 43.44 real 40.48 user 1.49 sys 501095 4180 11930
19 58.55 real 55.71 user 1.70 sys 810959 6764 19306
20 178.14 real 157.27 user 7.28 sys 1312244 10945 31239
21 351.97 real 276.47 user 17.54 sys 2123276 17710 50546
The result in terms of number of calls generated in the code completely dwarfs what we had in the previous code example, and goes completely non-linear:
So does overall code size:
In such a case, the difference in code size between compiling with an without exception handling becomes such a constraint that the compiler visibly changes strategy after a given size to limit the overhead of exception handling on the generated code:
In short, by purposefully building a pathological case, we can cause exception handling to cause almost arbitrarily large overhead on the generated code, and as a result to de-optimize it much earlier than it would otherwise have.
How bad can the de-optimization be?
Now, one might argue that this is just impacting the exceptional path, but has little impact on the main optimization path. Showing an earlier version of this article to my coworker, he essentially held that position. He notably wrote something that shows quite a bit of maturity in understanding what is required to deal with exceptions:
More specifically, there are no prologues or epilogues that all functions would execute all the time regardless of exceptions thrown or not as was the case before the zero-overhead implementations took over.
And, indeed, the whole point of the exception handling design we pioneered on IA-64, and the reason it became almost ubiquitous today, is that you can perform pretty neat forms of optimizations. That does not mean, however, that enabling exception handling cannot have a huge impact on the main path. To illustrate, let’s make the compiler grind a bit more for us.
After all, in order to generate exception-related work, we can simply have a call to construct or destruct in the innermost class. That’s sufficient, and it simplifies the generated code enough to show the effect beautifully. So the code now looks like this:
extern "C" void construct(), destruct();
template <unsigned N, typename T = void>
struct A
{
using D = A;
A() {}
~A() {}
A<N-1, D> l;
A<N-2, D> r;
unsigned S() { return l.S() + r.S(); }
};
template<typename T> struct A<0, T>
{
A() { construct(); }
~A() { destruct(); }
unsigned S() { return 1; }
};
template<typename T> struct A<1, T> { unsigned S() { return 1; } };
int main()
{
A<SIZE> a;
return a.S();
}
The generated code fo A<5> with exceptions disabled now looks like this, which is pretty neat and tidy:
c++ -DSIZE=5 -std=c++20 /tmp/glop.cpp -O3 -S -o /tmp/glop.s -fno-exceptions && cat /tmp/glop.s
.section __TEXT,__text,regular,pure_instructions
.build_version macos, 12, 0 sdk_version 12, 3
.globl _main ## -- Begin function main
.p2align 4, 0x90
_main: ## @main
.cfi_startproc
## %bb.0:
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset %rbp, -16
movq %rsp, %rbp
.cfi_def_cfa_register %rbp
callq _construct
callq _construct
callq _construct
callq _destruct
callq _destruct
callq _destruct
movl $8, %eax
popq %rbp
retq
.cfi_endproc
## -- End function
.subsections_via_symbols
Let us now run the exact same code with exception handling. Now, the main path code no longer looks so tidy at all. The main function alone now looks like this:
c++ -DSIZE=5 -std=c++20 /tmp/glop.cpp -O3 -S -o /tmp/glop.s -fexceptions && cat /tmp/glop.s
.section __TEXT,__text,regular,pure_instructions
.build_version macos, 12, 0 sdk_version 12, 3
.globl _main ## -- Begin function main
.p2align 4, 0x90
_main: ## @main
Lfunc_begin0:
.cfi_startproc
.cfi_personality 155, ___gxx_personality_v0
.cfi_lsda 16, Lexception0
## %bb.0:
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset %rbp, -16
movq %rsp, %rbp
.cfi_def_cfa_register %rbp
subq $16, %rsp
leaq -8(%rbp), %rdi
callq __ZN1AILj5EvEC2Ev
Ltmp0:
callq _destruct
Ltmp1:
## %bb.1:
Ltmp3:
callq _destruct
Ltmp4:
## %bb.2:
Ltmp6:
callq _destruct
Ltmp7:
## %bb.3:
movl $8, %eax
addq $16, %rsp
popq %rbp
retq
LBB0_6:
Ltmp8:
movq %rax, %rdi
callq ___clang_call_terminate
LBB0_5:
Ltmp5:
movq %rax, %rdi
callq ___clang_call_terminate
LBB0_4:
Ltmp2:
movq %rax, %rdi
callq ___clang_call_terminate
Lfunc_end0:
.cfi_endproc
.section __TEXT,__gcc_except_tab
.p2align 2
GCC_except_table0:
Lexception0:
.byte 255 ## @LPStart Encoding = omit
.byte 155 ## @TType Encoding = indirect pcrel sdata4
.uleb128 Lttbase0-Lttbaseref0
Lttbaseref0:
.byte 1 ## Call site Encoding = uleb128
.uleb128 Lcst_end0-Lcst_begin0
Lcst_begin0:
.uleb128 Lfunc_begin0-Lfunc_begin0 ## >> Call Site 1 <<
.uleb128 Ltmp0-Lfunc_begin0 ## Call between Lfunc_begin0 and Ltmp0
.byte 0 ## has no landing pad
.byte 0 ## On action: cleanup
.uleb128 Ltmp0-Lfunc_begin0 ## >> Call Site 2 <<
.uleb128 Ltmp1-Ltmp0 ## Call between Ltmp0 and Ltmp1
.uleb128 Ltmp2-Lfunc_begin0 ## jumps to Ltmp2
.byte 1 ## On action: 1
.uleb128 Ltmp3-Lfunc_begin0 ## >> Call Site 3 <<
.uleb128 Ltmp4-Ltmp3 ## Call between Ltmp3 and Ltmp4
.uleb128 Ltmp5-Lfunc_begin0 ## jumps to Ltmp5
.byte 1 ## On action: 1
.uleb128 Ltmp6-Lfunc_begin0 ## >> Call Site 4 <<
.uleb128 Ltmp7-Ltmp6 ## Call between Ltmp6 and Ltmp7
.uleb128 Ltmp8-Lfunc_begin0 ## jumps to Ltmp8
.byte 1 ## On action: 1
Lcst_end0:
.byte 1 ## >> Action Record 1 <<
## Catch TypeInfo 1
.byte 0 ## No further actions
.p2align 2
## >> Catch TypeInfos <<
.long 0 ## TypeInfo 1
Lttbase0:
.p2align 2
## -- End function
.section __TEXT,__text,regular,pure_instructions
.globl __ZN1AILj5EvEC2Ev ## -- Begin function _ZN1AILj5EvEC2Ev
.weak_def_can_be_hidden __ZN1AILj5EvEC2Ev
.p2align 4, 0x90
The really tricky part is that the nice sequence of calls we had before could no longer be inlined, because we need to be able to isolate where the exception was thrown by the bottom constructor, if that happens, and to do that precisely enough to be able to figure out exactly which partially-constructed objects need to be cleaned up.
So this time, the de-optimization is not just a matter of limited resources (after all, the code is pretty small in both cases), but of correctness. The compiler has to generate less efficient code because it now needs to distinguish where an exception was thrown, in order to correctly destroy potentially partially-constructed objects. As a result, it ends up generating a much more complicated call sequence in the main evaluation path.
Conclusion
While C++ compilers became really good at optimizing even in the presence of exceptions, the conclusions I reached in paragraph 3.8 of the article I wrote in 1999 remain valid even today.
In short, you can very often consider that exceptions have no cost in C++, but it’s better to verify if that hypothesis is true by actually measuring the performance.
Share this:
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK