Integrate the Incubating Panama Vector API by ChrisHegarty · Pull Request #1231...
source link: https://github.com/apache/lucene/pull/12311
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Conversation
Contributor
Leverage accelerated vector hardware instructions in Vector Search.
Lucene already has a mechanism that enables the use of non-final JDK APIs, currently used for the Previewing Pamana Foreign API. This change expands this mechanism to include the Incubating Pamana Vector API. When the jdk.incubator.vector
module is present at run time the Panamaized version of the low-level primitives used by Vector Search is enabled. If not present, the default scalar version of these low-level primitives is used (as it was previously).
Currently, we're only targeting support for JDK 20. A subsequent PR should evaluate JDK 21, which is still in development.
When i looked at this before, this was my impl:
https://issues.apache.org/jira/secure/attachment/13022378/LUCENE-9838_standalone.patch
Did a lot of jmh benchmarking at varying vector sizes/hardware to make sure there isn't crazy regressions: here is my benchmark in case it is useful: DotProduct.java.txt
basically I can explain a few differences:
- lets avoid fma() completely. It can make the whole thing dog slow even in "normal"/"typical" environments. fma is frequently unavailable in production-type environments (e.g. VMS) and openjdk DOES NOT fall back gracefully, it falls back horribly slow. see Ban use of Math.fma across the entire codebase #12014 for more details. And its definitely not good to use fma() in the vector impl but then inconsistently use */+ in the tail.
- we have to manually unroll at least a little for performance, java doesnt' do it. I recommend just doing it twice like my patch.
edit: attached my old jmh benchmark as a text file here
Contributor
Author
Thanks @rmuir - I just merged in your implementation. I think that it's a much much better starting (if not the final) place. This might be a reasonable minimal point to start from. Before digging too deeply into the performance and optimising the code, I guess I just want to understand if this is the right level to be plugging in at. |
Contributor
I'd prefer to have separate apijars, because the current code compiles with patching base module. I'd like to separate this. But as a start it is ok. |
Contributor
On the other hand: it just works! |
Contributor
Is vector's FMA also always slow (does it use BigDecimal, too?). |
Contributor
Author
Yeah, this is a bit of a hack!. It would be better to separate these out, but then what would we do, still patch it into java.base or build a slim module around it or something? It doesn't feel any better than just patching it all into java.base! |
Contributor
Author
I dunno what it does - I haven't looked - but I doubt it falls back to BD. I'll take a look and do some experiments. |
changed the title
Include the Panama Vector API stubs in the generated 19/20 api jars
Integrate the Incubating Panama Vector API
Member
I really wish Math.fma fell back to sane behavior such as */+ and only StrictMath.fma did the slow big decimal stuff! Not good decisionmaking here on these apis. |
Contributor
Let's keep it as is. The whole compile is a hack, modules do not matter. With separate apijar we could also add it to classpath as the package names are not special at all. |
Contributor
Author
I refactored the provider and impl's:
I'm less sure about no.2. The general thought was that the code might be more reusable from there, but now that I think about it, it might be better as package-private where it was, since the "interface" is through VectorUtils - not directly to the imp. Thoughts? |
Contributor
Hi, I added spec conformant support for superclasses/interfaces to extractor (recursively collecting classes). Please have a final look and review. |
Contributor
@ChrisHegarty Go ahead and merge / backport to 9.x. |
Contributor
Author
Thank you @rmuir and @uschindler, this has been fun! |
Contributor
Hey yeah was fun. I added one commit, 2 minutes too late. I sent to main branch. Was some change to the recursive extraction loop to not add a set to itsself. |
Contributor
If you cherry pick, take both commits. I will now update the Java 21 branch for mmap. |
Contributor
Merge for PR #12294 was quite simple, it was done mostly automatic. I only had to add the new version number to extractor. I did not add the vector code yet; we can do it in this PR or separately (just some coordination needed). If there are no code changes for the vector code needed in 21, I will copy over the 20 classes to the 21 folder. |
Member
Hmm should we add the generated Panama code to
|
Contributor
@mikemccand: This file is a relic in your checkout. Just clean your working copy. The file/folder was deleted by refactoring the build system to support mmap and vectors at same time (files/dirs renamed). |
Contributor
I'm seeing very strange results after testing with 768-dim vectors.
I have double-checked that candidate has 609fc9b and baseline is c9c49bc. Both conditions are run using the same index, tasks, etc. The JFR output is sort of a head-scratcher too. They clearly show that the JDK vector API is being called (I think?) but also show a lot of noisy things (Objects.requireNonNull). Is candidate
baseline
|
Contributor
just some extra data about theg above testing: I was using jdk20 for the candidate and jdk17 for the baseline (by accident I guess). Both were using |
Contributor
Are you sure you run with default JVM options like tiered compilation and not xbatch? Are tests running long enough, each run at least a minute? |
Contributor
Was this some kind of VM? Anyways for larger vectors it should run for longer time to get enough optimization rounds. The problem with the vector API is that it needs much longer to get "warm". If you measure only a few seconds on each JVM it just measures the warmup... |
Contributor
that's a helpful insight, yes I think we're really only doing a fairly quick run each time through - a handful of seconds. I'll try increasing that -- um no not a VM. No docker or anything like that. |
Contributor
Just out of curiosity, are we ok with this sort of class in Lucene? (minimal variable names, many ad hoc if-else etc) Not a polemic, I am genuinely curious because they seem far from being maintainable but I guess they are useful as they bring low level implementations goodies? |
Contributor
Hi @alessandrobenedetti, the code shown here is indeed crazy to read, but this is more a problem of the APIs in general. The Java Vector API is very low level and you have to exactly know how lanes, species and so on work. The code written by Robert is 100% according to the javadocs guidelines. The code on official JDK docs looks identical: https://docs.oracle.com/en/java/javase/20/docs/api/jdk.incubator.vector/jdk/incubator/vector/package-summary.html The arbitrary if/else constructs are a problem of underlying hardware infratstructure. It is NOT autogenerated, but follows low-level hardware specs, so there are arbitrary looking constants and if/else in it. This can be improved by moving numbers like 128 as constants, be free to make PRs! For performance reasons you should NOT split that up into too many different methods, as the code relies on escape analysis of the VM. We may split it later, but that's more a cleanup approach. An additional problem in the whole code is that it is Java version specific, so there will me multiple versions of the same code staying in different directories (java20, java21,...). Same for MMapDirectory. The extraction code for the Java APIs is special and a hack, but it is not part of Lucene's public library; it is a build tool only. Sorry for it, there's a new version now because a rewrite was needed to allow backporting and fix incomplete extraction: #12329 (this version looks much better, also technically bettrer organized). |
Contributor
hm I looked more closely at the test I ran and it seems I managed to create a file full of identical vectors -- so this is going to lead to crazy results. WIll follow up once I've managed to fix the vector creation |
Contributor
Hi, I changed the CHANGES.txt entry in main and 9.x to correctly refer to ARM's chipset feature (NEON). @rmuir asked me to correct it. See: https://en.wikipedia.org/wiki/ARM_architecture_family#Advanced_SIMD_(Neon) |
thanks @uschindler for the explanation, I appreciate the work you are doing! |
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
No one assigned
None yet
Successfully merging this pull request may close these issues.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK