Observing Java 19 JVM optimization with JMH + hsdis + PerfASM: Holy trinity of low-level benchmarking — Part I

This article focuses on installing and using the hsdis library as a decompiler for the JVMs emitted code.

Motivation

While investigating Java Vector API, it became necessary to understand the generated code to verify specific hypotheses. Unfortunately, the path of sophisticated benchmarking led to a very rocky but successful road:

JMH — Java Microbenchmarking Harness: A framework broadly used to benchmark Java methods in an atomic and capsulated way.
Hsdis.so: Hsdis is a disassembler library to make JVM-generated human-readable ASM code.
PerfASM: Linux Perf started as a tool to tap into the CPUs performance counters, such as cache miss, branch mispredict, etc. (Covered in part 2).

By using these three frameworks, it is possible to use method runtimes as a benchmark and tap into the underlying generated code. The code can help to understand which optimizations have been applied and which areas of the underlying hardware have been utilized.

The necessity for such a toolchain might sound very useless and niche, but it is mighty to understand how the Just-in-Time compiler (JIT) works. Further, it is possible to optimize algorithms based on their hardware-level limitations.

Installation

The toolkit works on any modern operating system as long as no virtualization layer abstracts the hardware. Virtualized environments have limitations in passing the CPUs performance counter to the Perf toolkit. When benchmarking, the results depend on the platform and can show different results based on cache size, micro-architecture, or CPU generation.

This installation guide was written on an Ubuntu 22.04 LTS system with the target of Java 19 and should be reproducible accordingly. (Java 18 and 20 should be similar). Likewise, it should work on OSX and Windows if the OS runs on real hardware, but installation and dependencies will differ.

Hsdis Installation

The HotSpot-Disassembler hsdis is a plugin for the HotSpot JVM compiler, which disassembles JIT-compiled native code back into mnemonic-based and human-readable assembly language. By default, Java is not shipped with any hsdis library and needs to be compiled to a specific target micro-architecture. In this case, OpenJDK 19 was used:

Check the Java version you want to use hsdis with and clone the matching GitHub repo.

git clone https://github.com/openjdk/jdk
git checkout jdk-19+36

2. Download binutils-2.38: https://ftp.gnu.org/gnu/binutils/
It is important to stick to binutils-2.38, there is a major API change in 2.39+, and there is still an ongoing PR at this point:
https://github.com/openjdk/jdk/pull/10817

wget https://ftp.gnu.org/gnu/binutils/binutils-2.38.tar.gz

3. Some dependencies might be needed; the exact setup might vary depending on the installation of the machine.

apt-get install build-essential
apt-get install libasound2-dev
apt-get install libcups2-dev
apt-get install libfontconfig1-dev
apt-get install libx11-dev libxext-dev libxrender-dev libxrandr-dev libxtst-dev libxt-dev

4. Make hsdis-amd64.so:

sh ./configure - with-hsdis=binutils - with-binutils-src=~/binutils-2.38/
make clean build-hsdis

5. The build file needs to be copied to the correct location:

.../build/linux-x86_64-server-release/support/hsdis
hsdis-amd64.so

Method

Once the installation is completed, it is time to figure out if everything works as expected. The following code should be enough to test the assembly code generation through the JIT compilers:

package ch.styp;

import java.util.Random;

public class TestHsdis {

public static void main(String... args) {
        TestHsdis testHsdis = new TestHsdis();
        var size = 2048;
        var left = initFloatArray(size);
        var right = initFloatArray(size);

for(int i = 0; i <= 10_000; i++) {
            var result = testHsdis.addArrays(left, right);
        }
    }

public static float[] initFloatArray(int length){
        var floatArray = new float[length];

Random rand = new Random();
        for(var i = 0; i<length; i++){
            floatArray[i] = rand.nextFloat();
        }
        return floatArray;
    }

private float[] addArrays(float[] left, float[] right) {
       float[] result = new float[left.length];
        for(int i=0; i < left.length; i++){
           result[i] = left[i] + right[i];
       }
        return result;
    }

}

The area to investigate in the provided code snipped is the addArrays method which the JIT compilers should optimize.

The method gets invoked 10'000 times to trigger the C2 compiler. Java has two sets of JIT compilers: The C1 and the C2 compiler; the first was designed to run fast and is needed where only a few method invocations are used, whereas the second is used where a little more compile time doesn’t hurt. Compile time is fine if many invocations occur, and a shortened per-invocation runtime can offset the optimization overhead of the JIT compiler.

|    | Calls  |
|----|--------|
| C1 |  1'500 |
| C2 | 10'000 |
|----|--------|

According to the Oracle documentation (check resource section), the table shows which JIT compiler is used and the needed method calls count. Using the C2 compiler can be enforced by setting the invocation number to greater than 10'000. Therefore the demo code had to be looped accordingly often.

To unlock the diagnostic options and thereby generate the assembly code, the following compiler flags need to be added:

javac ch/styp/Test
java -XX:+UnlockDiagnosticVMOptions \
     -XX:+PrintAssembly \
     -Xlog:class+load=info \
     -XX:+LogCompilation  ch/styp/TestHsdis > test.txt

The Java JVM accepts options by passing -:X or -:XX (experimental) to the command line arguments. Features can be activated by adding a (+) sign to the requested feature and deactivated by adding a (-) sign accordingly. These options are also handy in forcing the JVM to perform specific optimizations or suppressing them. Unfortunately, the documentation (check resource section) is not on par with the feature set, and there is no resource covering all the current features.

Results

The generated assembly code can be confusing at first. Finding the right passage is difficult, but the Java line number can often indicate where to important part lies. In this case, the output file was over 3000 lines of assembly code and the most relevant couple of lines:

  0x00007f5c3c7b5dd3:   mov    0x18(%rsp),%r10
  0x00007f5c3c7b5dd8:   vmovdqu 0x10(%r10,%r8,4),%ymm0
  0x00007f5c3c7b5ddf:   mov    0x8(%rsp),%r10
  0x00007f5c3c7b5de4:   vaddps 0x10(%r10,%r8,4),%ymm0,%ymm0
  0x00007f5c3c7b5deb:   vmovdqu %ymm0,0x10(%rcx,%r8,4)
  0x00007f5c3c7b5df2:   mov    0x18(%rsp),%r10
  0x00007f5c3c7b5df7:   vmovdqu 0x30(%r10,%r8,4),%ymm0
  0x00007f5c3c7b5dfe:   mov    0x8(%rsp),%r10
  0x00007f5c3c7b5e03:   vaddps 0x30(%r10,%r8,4),%ymm0,%ymm0
  0x00007f5c3c7b5e0a:   vmovdqu %ymm0,0x30(%rcx,%r8,4)
  0x00007f5c3c7b5e11:   mov    0x18(%rsp),%r10
  0x00007f5c3c7b5e16:   vmovdqu 0x50(%r10,%r8,4),%ymm0
  0x00007f5c3c7b5e1d:   mov    0x8(%rsp),%r10
  0x00007f5c3c7b5e22:   vaddps 0x50(%r10,%r8,4),%ymm0,%ymm0
  0x00007f5c3c7b5e29:   vmovdqu %ymm0,0x50(%rcx,%r8,4)
  0x00007f5c3c7b5e30:   mov    0x18(%rsp),%r10
  0x00007f5c3c7b5e35:   vmovdqu 0x70(%r10,%r8,4),%ymm0
  0x00007f5c3c7b5e3c:   mov    0x8(%rsp),%r10
  0x00007f5c3c7b5e41:   vaddps 0x70(%r10,%r8,4),%ymm0,%ymm0
  0x00007f5c3c7b5e48:   vmovdqu %ymm0,0x70(%rcx,%r8,4)      ;*fastore {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - ch.styp.TestHsdis::addArrays@27 (line 31)
                                                            ; - ch.styp.TestHsdis::main@37 (line 14)
  0x00007f5c3c7b5e4f:   add    $0x20,%r8d                   ;*iinc {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - ch.styp.TestHsdis::addArrays@28 (line 30)
                                                            ; - ch.styp.TestHsdis::main@37 (line 14)
                                                            ;   {no_reloc}
  0x00007f5c3c7b5e53:   cmp    %r9d,%r8d
  0x00007f5c3c7b5e56:   jl     0x00007f5c3c7b5dd3           ;*goto {reexecute=0 rethrow=0 return_oop=0}

Don’t worry; a more detailed explanation will follow. The code will be dissected in the discussion section to show which optimization methods the JVM applied to the provided code sample.

Pro-Tip: Often, it is enough to search for “C2-” in the output. The area of interest is C2 optimized, as enforced in the example. This trick often helps to get into the ballpark of the line number where the magic is happening.

Discussion

So what did we learn apart from how to install hsdis for Java 19? A lot!

By looking closer into the code, we can already spot two optimization efforts that the JIT is applying to optimize the snipped.

Auto-vectorization

For anyone who has read the previous post, the term Auto-Vectorisation shouldn’t be something new or unknown. For the ones starting on this topic: Vectorisation on the CPU level is the ability to process multiple operations simultaneously, but only if all the operations are the same. In the provided case, vectorization is possible, consisting of applying a plus operator on two arrays. Instead of adding each array position after the other, the optimization technique can take a bunch of indices and add them at once. The operator that needs to be applied is always the same (+). SIMD parallelism (Vectorisation) is a CPU feature and varies depending on the platform.

The assembly code reveals how the code ran on the CPU:

0x00007f5c3c7b5de4:   vaddps 0x10(%r10,%r8,4),%ymm0,%ymm0

The mnemonic vaddps is translated to ‘Vectorized Add Packed Single-Precision Floating-Point’, indicating the CPU leveraged vector accelerated extensions.

0x00007f5c3c7b5de4:   vaddps 0x10(%r10,%r8,4),%ymm0,%ymm0
...
0x00007f5c3c7b5e03:   vaddps 0x30(%r10,%r8,4),%ymm0,%ymm0

Further, the distance between these two memory addresses is 0x20. The number indicates the memory offset, which is 32 Bytes in this case. Assuming that a standard Java Float type consists of 4 bytes per Float, eight numbers had to be processed in the CPU at once. The Intel assembly guide confirms this and indicates that the %ymm0 registers can hold up to 8 32-bit numbers (AVX). So it is proven that although the example is a simple sequential piece of code, the C2 compiler optimized it to leverage the AVX instruction set!

Loop Unrolling

Loop unrolling is a technique to optimize the code to use fewer instructions to complete the same code sequence. The key idea is to reduce the compare and jump instructions and execute more “meaningful” operations in each loop.

The following code passage shows a piece of normal Java-Code:

for (int x = 0; i < 100; i++){
  do_something(i);
}

By executing the program, the CPU executes the ‘do-something(x)’ part and will then have to increment the variable i by one and compare it with the branch condition x < 100. If the condition is true, the CPU must return to the top of the code. Translated to assembly code:

add data (i) - (doing actual work)
add i +1(increment)
cmp (compare)
jl (jump to)

The result is approximately 400 instructions that are emitted on the CPU to chew through this loop with 100 elements.

On the other hand, an unrolled loop contains multiple instructions in one loop and increments the counter by more than one. Translated to assembly:

add data (0)
add data (1)
add data (2)
add data (3)
add i +4 (increment)
cmp (compare)
jl (jump to)

The result is approximately 175 instructions for the 100 provided elements, as written in this example:

for (int x = 0; i < 100; i += 4){
  do_something(i + 0);
  do_something(i + 1);
  do_something(i + 2);
  do_something(i + 3);
}

Although this compares apples and pears, the loop efficiency increases drastically. Consequently, the code runs faster. A downside is that the code needs more space as the assembly file contains a longer instruction sequence through loop-unrolling.

Pro-Tip: you should not unroll your loops manually! The C1 and C2 compilers are so clever that they work on any target platform. Manual loop-unrolling might increase performance in some cases, but this is limited to edge cases and a highly platform-specific deployment.

Conclusion

To conclude this piece, the key takeaway should not only be the setup of hsdis.so to generate assembly code out of Java, but also an eye-opener to the JIT ability to optimize code. It is really interesting to see that a simple like this piece of code gets optimized by two already sophisticated optimizations; Auto-vectorisation and loop-unrolling.

I used this particular tool and method to investigate the low-level behavior for my previous blog post:

With all the clever optimizations in place, beating the JVMs C1 and C2 compilers is challenging as they are so well optimized and platform interoperable; the only possibility to outperform the JVM is in edge cases or on particular devices.

Acknowledgment

Marc Juchli: For his valuable input in improving the readability and fixing some coherency issues.
Kirusanth Poopalasingam: For his valuable input in some key passages of the blog post.

Observing Java 19 JVM optimization with JMH + hsdis + PerfASM: Holy trinity of l...

Observing Java 19 JVM optimization with JMH + hsdis + PerfASM: Holy trinity of low-level benchmarking — Part I

Motivation

Installation

Hsdis Installation

Method

Results

Discussion

Auto-vectorization

Loop Unrolling

Conclusion

Acknowledgment

Resources

Java HotSpot VM

Please note that this page only applies to JDK 7 and earlier releases. For JDK 8 please see the Windows, Solaris…

What the JIT!? Anatomy of the OpenJDK HotSpot VM

Applications can select an appropriate JIT compiler to produce near-machine level performance optimizations. Tiered…

Recommend

华为7款机型开启HarmonyOS 3正式版升级有你的吗？

Intel Linux Optimizations Help AMD EPYC "Genoa" Improve Scaling To 384...

Twitter Creator - Stop being unproductive with Twitter | Product Hunt

2023年第一季成绩单

DIGITIMES Research：预估 2026 年全球 VR 头戴装置市场规模达 288 亿美元

999元起中兴远航40上架：紫光展锐T760、10W快充

Utilizing the capacity below 0V to maximize lithium storage of hard carbon anode...

GitHub - alexandercerutti/sub37: A set of libraries to parse, serve and show sub...

美国议员：美国应该回归金本位，而不是依赖数字货币支付系统

Chartello for Laravel

About Joyk