Optimizing the way to Valhalla: JIT Status Update

Fri Dec 13 15:59:39 UTC 2019

Hi,

I thought now is a good time to give a high level status update about the inline type implementation
and optimizations in the JIT. The goal is to set realistic expectations about what is currently
optimized and about what could be added with reasonable engineering effort in the near future.

Below, I'm only distinguishing between null-free (.inline) and nullable (.ref) inline types because
that's what the JIT cares about most.

After the LW1 EA binaries were released in July 2018, we were working towards LW2:
- C2 support for LW2 specific features: nullable and non-scalarized inline types, array covariance,
substitutability check, conversions/casting between types and calling convention changes [1].
- Various performance improvements: Object array accesses, aaload/aastore profiling, reflective
invocations, synchronization, lock coarsening, unsafe/hashCode/reflection/array intrinsics and
inline type array specific loop unswitching to mitigate impact on legacy code [2].
- Other new inline type specific C2 optimizations [3].
- Full C1 support for LW2 including calling convention.
- Stabilization work: fixed ~130 compiler bugs for LW1 and LW2 [4].
- Thousands of lines of new test code and many extensions to our inline type specific test framework.

Below are some more details about various optimizations that might be of interest.

Array access (aaload, aastore):
- Optimized flattened load/store if array type is known at compile time, runtime call otherwise.
- Optimized runtime checks based on array storage properties.
- Type information is also used to guide optimizations of following code and omit runtime checks:
  - After successfully storing null, the destination array can't be null-free (-> not flat).
  - After successfully casting an array element to a non-inline-type, the source array can't be
null-free (-> not flat).
  - After successfully casting an array element to a non-flattenable type, the source array can't be
flat.
  - Speculate on varargs Object array being not null-free (-> not flat).
- Loop unswitching and hoisting based on flattened array check to mitigate performance impact on
Object array accesses.
- New profiling points for array type, element type and whether the array is flat or null-free are
collected for both aaload and aastore. We then speculate based on these properties.

Optimized acmp implementation:
  if (a == b) {
    return true;
  } else if (a != NULL && a.isValue() && b != NULL && a.type() == b.type()) {
    // Slow runtime call for the substitutability check
    return ValueBootstrapMethods.isSubstitutable();
  } else {
    return false;
  }
- Based on type system information, C2 is often able to remove parts or all of the above.
- Implicit null checks and knowledge about nullity/flatness are used to improve remaining checks.
- We currently always delegate the substitutability check to the runtime (-> slow).
- Planned: profiling [5] and optimized substitutability check [6].

Scalarization in the scope of a compiled method:
- C2 is aggressively scalarizing whenever null-free inline types are created, loaded or passed. For
example, at defaultvalue, withfield, flattened array/field load, through inlined calls/returns (also
method handles and incremental inlining), scalarized calls and returns. This means that each field
of the inline type is passed individually in registers or on the stack and no heap allocation is
necessary.
- In addition, we attempt to prove or speculate that nullable inline types are null-free and then
scalarize these as well. Please note that this is *not* done across call boundaries.

Scalarized calling convention:
- Null-free inline types are passed as arguments and returned in a scalarized form. That means that
instead of passing/returning a pointer, each field of the inline type is passed/returned
individually in registers or on the stack.
- The implementation is very complex because we need to handle mismatches in the calling convention
between the interpreter, C1 and C2. The following variants exist:
  - All null-free inline type arguments are scalarized (C2).
  - Inline type receiver is not scalarized (interface, method handle call).
  - No arguments are scalarized (interpreter and C1).
  We can basically have any combination of the above where there is an inconsistency between what
the caller passes and what the callee expects (in many cases, the caller does not "know" what the
callee expects). To solve that, we need to translate between calling conventions in the adapters /
entry points by allocating and packing or unpacking. The same problem exists for returns.
- Nullable inline types are *not* scalarized in the calling convention. That's mainly because the VM
only supports one compiled version of each method. If we would speculatively scalarize an inline
type argument, that compiled method could not handle null and we would need to deoptimize when
seeing null (-> huge, unexpected performance impact). Since scalarized adapters are created at link
time, we would also not be able to re-compile that method without scalarization, i.e. passing null
will always be extremely slow. Related to that, Roland investigated lazy adapter creation a while
ago and explained some of the additional problems here [7].
- One option to scalarize nullable inline types in the calling convention would be to pass an
additional, artificial field that can be used to check if the inline type is null. Compiled code
would then "null-check" before using the fields. However, this solution is far from trivial to
implement and the overhead of the additional fields and the runtime checks might cancel out the
improvements of scalarization. Also, the VM would need to ensure that the type is loaded when the
adapters are created at method link time.

The currently planned JIT work can be found here:
https://bugs.openjdk.java.net/issues/?filter=36444

In my opinion, the main ongoing challenge is that we don't have a good understanding about what
still needs to be done with respect to performance. For example, we don't have any numbers on the
performance impact of the calling convention optimization. We also need to evaluate the inline type
specific profiling that we have. Is the current version good enough? Do we need more? In general, we
need to identify performance issues and prioritize them. After all, this entire project is solely
about performance.

Hope this helps. Please let me know if you have any questions.

Best regards,
Tobias

[1] For example:
https://bugs.openjdk.java.net/browse/JDK-8215477
https://bugs.openjdk.java.net/browse/JDK-8220716
https://bugs.openjdk.java.net/browse/JDK-8215559
https://bugs.openjdk.java.net/browse/JDK-8206139
https://bugs.openjdk.java.net/browse/JDK-8212190
https://bugs.openjdk.java.net/browse/JDK-8211772
[2] For example:
https://bugs.openjdk.java.net/browse/JDK-8227634
https://bugs.openjdk.java.net/browse/JDK-8227463
https://bugs.openjdk.java.net/browse/JDK-8229288
https://bugs.openjdk.java.net/browse/JDK-8222221
[3] For example:
https://bugs.openjdk.java.net/browse/JDK-8220666
https://bugs.openjdk.java.net/browse/JDK-8227180
https://bugs.openjdk.java.net/browse/JDK-8228367
[4] https://bugs.openjdk.java.net/secure/Dashboard.jspa?selectPageId=18410
[5] https://bugs.openjdk.java.net/browse/JDK-8235914
[6] https://bugs.openjdk.java.net/browse/JDK-8228361
[7] https://mail.openjdk.java.net/pipermail/valhalla-dev/2018-April/004093.html

Optimizing the way to Valhalla: JIT Status Update

Optimizing the way to Valhalla: JIT Status Update

Recommend

HNTerm – Browse Hacker News in the Terminal

JVM Garbage Collectors Benchmarks Report 19.12

Rustysd – systemd replacement written in Rust

线下课程 | 最容易受提拔的产品人，都会在产品规划管理的这4件事上表现出色

三个角度解析新式茶饮行业

近一年来觉得过誉的产品越来越多

广州几个月没有下雨了，天象异常啊

GNNets：自然场景下文字检测的几何归一化网络 | ICCV 2019

Pointer Authentication

梯度下降背后的数学原理几何？

About Joyk