2

Integrate the Incubating Panama Vector API by ChrisHegarty · Pull Request #1231...

 1 year ago
source link: https://github.com/apache/lucene/pull/12311
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Conversation

Contributor

Leverage accelerated vector hardware instructions in Vector Search.

Lucene already has a mechanism that enables the use of non-final JDK APIs, currently used for the Previewing Pamana Foreign API. This change expands this mechanism to include the Incubating Pamana Vector API. When the jdk.incubator.vector module is present at run time the Panamaized version of the low-level primitives used by Vector Search is enabled. If not present, the default scalar version of these low-level primitives is used (as it was previously).

Currently, we're only targeting support for JDK 20. A subsequent PR should evaluate JDK 21, which is still in development.

benwtrent, Xianzhan, tang-hi, uschindler, jdconrad, dnhatn, jasirkt, msokolov, ChrisHegarty, mayya-sharipova, and 7 more reacted with rocket emoji

Member

@rmuir rmuir

left a comment

edited

When i looked at this before, this was my impl:
https://issues.apache.org/jira/secure/attachment/13022378/LUCENE-9838_standalone.patch

Did a lot of jmh benchmarking at varying vector sizes/hardware to make sure there isn't crazy regressions: here is my benchmark in case it is useful: DotProduct.java.txt

basically I can explain a few differences:

  • lets avoid fma() completely. It can make the whole thing dog slow even in "normal"/"typical" environments. fma is frequently unavailable in production-type environments (e.g. VMS) and openjdk DOES NOT fall back gracefully, it falls back horribly slow. see Ban use of Math.fma across the entire codebase #12014 for more details. And its definitely not good to use fma() in the vector impl but then inconsistently use */+ in the tail.
  • we have to manually unroll at least a little for performance, java doesnt' do it. I recommend just doing it twice like my patch.

edit: attached my old jmh benchmark as a text file here

ChrisHegarty and uschindler reacted with thumbs up emoji

Contributor

Author

Thanks @rmuir - I just merged in your implementation. I think that it's a much much better starting (if not the final) place. This might be a reasonable minimal point to start from.

Before digging too deeply into the performance and optimising the code, I guess I just want to understand if this is the right level to be plugging in at.

Contributor

I'd prefer to have separate apijars, because the current code compiles with patching base module.

I'd like to separate this. But as a start it is ok.

Contributor

On the other hand: it just works! wink

Contributor

Is vector's FMA also always slow (does it use BigDecimal, too?).

Contributor

Author

I'd prefer to have separate apijars, because the current code compiles with patching base module.
On the other hand: it just works! wink

Yeah, this is a bit of a hack!. It would be better to separate these out, but then what would we do, still patch it into java.base or build a slim module around it or something? It doesn't feel any better than just patching it all into java.base!

uschindler reacted with thumbs up emoji

Contributor

Author

Is vector's FMA also always slow (does it use BigDecimal, too?).

I dunno what it does - I haven't looked - but I doubt it falls back to BD. I'll take a look and do some experiments.

ChrisHegarty

changed the title Include the Panama Vector API stubs in the generated 19/20 api jars

Integrate the Incubating Panama Vector API

May 18, 2023

Member

I really wish Math.fma fell back to sane behavior such as */+ and only StrictMath.fma did the slow big decimal stuff! Not good decisionmaking here on these apis.

Contributor

I'd prefer to have separate apijars, because the current code compiles with patching base module.
On the other hand: it just works! wink

Yeah, this is a bit of a hack!. It would be better to separate these out, but then what would we do, still patch it into java.base or build a slim module around it or something? It doesn't feel any better than just patching it all into java.base!

Let's keep it as is. The whole compile is a hack, modules do not matter.

With separate apijar we could also add it to classpath as the package names are not special at all.

Contributor

Author

I refactored the provider and impl's:

  1. So as to separate them out from VectorUtil - this should improve readability, etc, as we move beyond dotProduct.
  2. I also moved them into a it's own non-exported package.

I'm less sure about no.2. The general thought was that the code might be more reusable from there, but now that I think about it, it might be better as package-private where it was, since the "interface" is through VectorUtils - not directly to the imp. Thoughts?

Contributor

Hi, I added spec conformant support for superclasses/interfaces to extractor (recursively collecting classes).
I also added a changes.txt and merged main branch.

Please have a final look and review.

ChrisHegarty reacted with thumbs up emoji

Contributor

@ChrisHegarty Go ahead and merge / backport to 9.x.

ChrisHegarty

merged commit f756f90 into

apache:main

May 25, 2023

4 checks passed

Contributor

Author

Thank you @rmuir and @uschindler, this has been fun! rocket

uschindler reacted with thumbs up emojiuschindler reacted with rocket emoji

Contributor

Hey yeah was fun. I added one commit, 2 minutes too late. I sent to main branch. Was some change to the recursive extraction loop to not add a set to itsself.

ChrisHegarty reacted with thumbs up emoji

Contributor

If you cherry pick, take both commits. I will now update the Java 21 branch for mmap.

ChrisHegarty reacted with thumbs up emoji

Contributor

Merge for PR #12294 was quite simple, it was done mostly automatic. I only had to add the new version number to extractor. I did not add the vector code yet; we can do it in this PR or separately (just some coordination needed).

If there are no code changes for the vector code needed in 21, I will copy over the 20 classes to the 21 folder.

ChrisHegarty reacted with thumbs up emoji

Member

Hmm should we add the generated Panama code to .gitignore? My ./gradlew precommit is failing with:

* What went wrong:
Execution failed for task ':checkWorkingCopyClean'.
> Working copy is not a clean git checkout (skip with -Pvalidation.git.failOnModified=false), offending files:
    - gradle/generation/panama-foreign (untracked non-empty dir)
    - gradle/generation/panama-foreign/ExtractForeignAPI.java (untracked)

Contributor

@mikemccand: This file is a relic in your checkout. Just clean your working copy. The file/folder was deleted by refactoring the build system to support mmap and vectors at same time (files/dirs renamed).

Contributor

I'm seeing very strange results after testing with 768-dim vectors.

                            TaskQPS baseline      StdDevQPS candidate      StdDev                Pct diff p-value                                                         
                  HighTermVector      139.27      (6.7%)       57.21      (0.8%)  -58.9% ( -62% -  -55%) 0.000                                                            
               AndHighHighVector      139.15      (6.4%)       57.15      (0.7%)  -58.9% ( -62% -  -55%) 0.000                                                            
                   LowTermVector      138.54      (6.2%)       57.00      (0.8%)  -58.9% ( -62% -  -55%) 0.000                                                            
                   MedTermVector      138.67      (6.9%)       57.23      (0.9%)  -58.7% ( -62% -  -54%) 0.000                                                            
                AndHighMedVector      137.95      (6.1%)       57.10      (0.8%)  -58.6% ( -61% -  -55%) 0.000                                                            
                AndHighLowVector      137.86      (6.4%)       57.21      (0.8%)  -58.5% ( -61% -  -54%) 0.000                                                            
                        PKLookup      199.30      (2.3%)      198.44      (2.5%)   -0.4% (  -5% -    4%) 0.565  

I have double-checked that candidate has 609fc9b and baseline is c9c49bc. Both conditions are run using the same index, tasks, etc. The JFR output is sort of a head-scratcher too. They clearly show that the JDK vector API is being called (I think?) but also show a lot of noisy things (Objects.requireNonNull). Is jdk.incubator.vector.FloatVector#reduceLanesTemplate() really what we want to see there? Another mystery is why org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel() would have more samples in the candidate given that it has nothing to do with the vector api. In looking at these you should ignore things related to VectorDictionary - these are loading a word->vector dictionary that is used to look up query vectors; the loading takes a while but happens prior to the index being opened and the queries being executed.

candidate

PERCENT       CPU SAMPLES   STACK                                                                                                                                             
9.00%         2721          jdk.incubator.vector.FloatVector#reduceLanesTemplate()                                                                                            
5.44%         1645          org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel()                                                                                       
3.89%         1177          org.apache.lucene.util.packed.DirectReader$DirectPackedReader12#get()                                                                             
3.03%         917           org.apache.lucene.util.packed.DirectReader$DirectPackedReader16#get()                                                                             
2.99%         904           java.lang.invoke.VarHandleGuards#guard_LJ_I()                                                                                                     
2.70%         816           jdk.internal.misc.Unsafe#copyMemoryChecks()                                                                                                       
2.52%         761           perf.VectorDictionary#vectorDiv()                                                                                                                 
2.45%         741           org.apache.lucene.store.DataInput#readVInt()                                                                                                      
2.43%         734           org.apache.lucene.util.packed.DirectMonotonicReader#get()                                                                                         
2.17%         655           java.util.Objects#requireNonNull()                                                                                                                
1.92%         580           jdk.jfr.internal.JVM#emitEvent()                                                                                                                  
1.70%         513           java.util.Arrays#binarySearch0()                                                                                                                  
1.40%         423           org.apache.lucene.codecs.lucene95.Lucene95HnswVectorsReader$OffHeapHnswGraph#seek()                                                               
1.35%         407           java.util.HashMap#resize()                                                                                                                        
1.29%         391           org.apache.lucene.codecs.lucene95.OffHeapFloatVectorValues#vectorValue()                                                                          
1.29%         389           sun.nio.ch.UnixFileDispatcherImpl#unmap0()                                                                                                        
1.27%         385           org.apache.lucene.util.VectorUtilPanamaProvider#dotProduct()                                                                                      
1.26%         380           jdk.internal.foreign.AbstractMemorySegmentImpl#checkBounds()                                                                                      
1.23%         373           org.apache.lucene.util.LongHeap#downHeap()                                                                                                        
1.22%         370           jdk.incubator.vector.FloatVector#fromArray0Template()                                                                                             
1.19%         361           java.util.zip.Inflater#inflateBytesBytes()                                                                                                        
1.15%         348           org.apache.lucene.util.hnsw.NeighborQueue#decodeScore()                                                                                           
1.13%         341           org.apache.lucene.util.SparseFixedBitSet#getAndSet()                                                                                              
1.12%         338           org.apache.lucene.util.SparseFixedBitSet#insertLong()                                                                                             
1.02%         308           jdk.internal.misc.Unsafe#copyMemory()                                                                                                             
0.96%         289           jdk.internal.foreign.AbstractMemorySegmentImpl#copy()                                                                                             
0.95%         286           perf.VectorDictionary#<init>()                                                                                                                    
0.94%         283           java.util.Objects#checkIndex()                                                                                                                    
0.88%         267           jdk.internal.foreign.AbstractMemorySegmentImpl#getBaseAndScale()                                                                                  
0.85%         257           org.apache.lucene.store.MemorySegmentIndexInput#readByte()

baseline

PERCENT       CPU SAMPLES   STACK
24.69%        7386          perf.VectorDictionary#vectorDiv()
20.49%        6127          org.apache.lucene.util.VectorUtil#dotProduct()
3.42%         1022          java.nio.FloatBuffer#getArray()
2.65%         792           org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel()
2.50%         748           perf.VectorDictionary#<init>()
2.34%         699           jdk.internal.misc.Unsafe#checkPrimitivePointer()
1.95%         583           jdk.jfr.internal.JVM#emitEvent()
1.62%         485           java.util.HashMap#getNode()
1.61%         481           org.apache.lucene.util.SparseFixedBitSet#insertLong()
1.50%         450           java.nio.Buffer#position()
1.32%         395           java.util.HashMap#resize()
1.03%         309           java.io.BufferedReader#readLine()
0.98%         292           jdk.internal.misc.Unsafe#checkOffset()
0.91%         272           perf.PKLookupTask#go()
0.90%         270           java.util.zip.Inflater#inflateBytesBytes()
0.85%         253           java.util.HashMap#containsKey()
0.79%         236           org.apache.lucene.util.LongHeap#downHeap()
0.67%         200           org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#seekExact()
0.65%         194           sun.nio.ch.FileChannelImpl#unmap0()
0.57%         170           org.apache.lucene.util.SparseFixedBitSet#getAndSet()
0.53%         159           jdk.internal.util.ArraysSupport#mismatch()
0.51%         154           org.apache.lucene.util.hnsw.HnswGraphSearcher#graphNextNeighbor()
0.48%         145           sun.nio.cs.UTF_8$Decoder#decodeArrayLoop()
0.48%         143           java.io.FileOutputStream#writeBytes()
0.46%         137           org.apache.lucene.store.ByteBufferGuard#getByte()
0.44%         131           org.apache.lucene.codecs.lucene95.Lucene95HnswVectorsReader$OffHeapHnswGraph#seek()
0.41%         123           java.nio.Buffer#checkIndex()
0.40%         120           org.apache.lucene.store.DataInput#readVInt()
0.38%         113           perf.VectorDictionary#vectorNorm()
0.35%         104           org.apache.lucene.util.LongHeap#upHeap()

Contributor

just some extra data about theg above testing: I was using jdk20 for the candidate and jdk17 for the baseline (by accident I guess). Both were using --add-modules jdk.incubator.vector. The machine I ran on is an Intel with avx2 sse4_1 sse4_2 and lots of other flags - I can post if someone thinks it might matter

Contributor

Are you sure you run with default JVM options like tiered compilation and not xbatch?

Are tests running long enough, each run at least a minute?

Contributor

Was this some kind of VM?

Anyways for larger vectors it should run for longer time to get enough optimization rounds. The problem with the vector API is that it needs much longer to get "warm". If you measure only a few seconds on each JVM it just measures the warmup...

Contributor

that's a helpful insight, yes I think we're really only doing a fairly quick run each time through - a handful of seconds. I'll try increasing that -- um no not a VM. No docker or anything like that.

Contributor

Just out of curiosity, are we ok with this sort of class in Lucene? (minimal variable names, many ad hoc if-else etc)
Are some of them auto-generated? (for example lucene/core/src/java20/org/apache/lucene/util/VectorUtilPanamaProvider.java)
What's the standard approach in these scenarios?

Not a polemic, I am genuinely curious because they seem far from being maintainable but I guess they are useful as they bring low level implementations goodies?
Pardon me if it's a silly question, I tend to be overly concerned about not-readable code, but to be fair I didn't spend much time on it so an honest answer could be "it's already as readable as it could be :) "

Contributor

Hi @alessandrobenedetti, the code shown here is indeed crazy to read, but this is more a problem of the APIs in general. The Java Vector API is very low level and you have to exactly know how lanes, species and so on work. The code written by Robert is 100% according to the javadocs guidelines.
There's also a lot of other low level code (codecs like BlockTermsReader) in Lucene, so it is not new. Also MMapDirectory indexinput look like that. They are not beatiful but optimized for performance. To me the variable names are perfectly fine vor vector code (ab, a1b,...). This is typical in that area. It won't get better with other names.

The code on official JDK docs looks identical: https://docs.oracle.com/en/java/javase/20/docs/api/jdk.incubator.vector/jdk/incubator/vector/package-summary.html

The arbitrary if/else constructs are a problem of underlying hardware infratstructure. It is NOT autogenerated, but follows low-level hardware specs, so there are arbitrary looking constants and if/else in it. This can be improved by moving numbers like 128 as constants, be free to make PRs! For performance reasons you should NOT split that up into too many different methods, as the code relies on escape analysis of the VM. We may split it later, but that's more a cleanup approach.

An additional problem in the whole code is that it is Java version specific, so there will me multiple versions of the same code staying in different directories (java20, java21,...). Same for MMapDirectory.

The extraction code for the Java APIs is special and a hack, but it is not part of Lucene's public library; it is a build tool only. Sorry for it, there's a new version now because a rewrite was needed to allow backporting and fix incomplete extraction: #12329 (this version looks much better, also technically bettrer organized).

Contributor

hm I looked more closely at the test I ran and it seems I managed to create a file full of identical vectors -- so this is going to lead to crazy results. WIll follow up once I've managed to fix the vector creation

Contributor

Hi, I changed the CHANGES.txt entry in main and 9.x to correctly refer to ARM's chipset feature (NEON). @rmuir asked me to correct it. See: https://en.wikipedia.org/wiki/ARM_architecture_family#Advanced_SIMD_(Neon)

thanks @uschindler for the explanation, I appreciate the work you are doing!

uschindler reacted with thumbs up emoji

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

uschindler

uschindler approved these changes

rmuir

Assignees

No one assigned

Projects

None yet

Milestone

9.7.0

Development

Successfully merging this pull request may close these issues.

vector API integration, plan B

7 participants

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK