A little confusing because "vector" here (largely) refers to two different things. "Vector search" being this ANN thing, but the "Vector API" is about SIMD. SIMD provides CPU operations on a bunch of data at a time, i.e. instead of one instruction for each 32-bit float, you operate on, depending on the CPU, 128 or 256 or 512 bits worth of floats at the same time. So, over scalar code, SIMD here could get maybe a 4-16x improvement (give or take a lot - things here are pretty complicated). So, while definitely a significant change, I wouldn't say it's at the make-or-break level.
As add-on to this comment: There's another Lucene issue from 2 weeks ago that provides some more details on different approaches that were considered: https://github.com/apache/lucene/issues/12302
Great explanation. But to be clear to those who don't follow:
SIMD is supported by Java out of the box but the optimizer might miss some opportunities. With this API it is far more likely that SIMD will be used if it's available and on first compilation so performance should be improved.
Lucene here just dealing with plain float[]s, so Valhalla at least shouldn't affect it much. It seems the limiting thing here is that it has sum accumulators, which the optimizer can't reorder because addition isn't associative.
>> How did people do this before Lucene supported it?
By performing query expansion based on features of documents within the search results. Very efficient and effective if you have indexed the right features.
This will be a big deal if Lucene got competitive on http://ann-benchmarks.com if it became a serious alternative (and more holistic) than the vector databases.
But it comes with continued challenges if I understand:
- Panama is an incubating API and Java has taken its time having an official way of using SIMD. It could all change in Java 22
- It only works on Java 20, with a very specific set of flags passed to the JVM. It’ll take time for this change to make it into Elasticsearch and Solr
- Panama itself is a weird and very low level API.
- Lucene organizes the HNSW vector index graph alongside its inverted index segments. And these need to be merged/compacted periodically. Merging HNSW graphs, as I understand it, is computationally difficult as the graph gets rebuilt.
Elasticsearch/Lucene do not use HNSWLib, they use their own Java implementation of the HNSW algorithm. But Java, being a virtual machine, doesn't have access to some lower level CPU instructions known to make certain calculations of vector similarity fast (aka SIMD).
Java has taken its sweet time exposing these optimizations in a consistent way. They're available by turning on a flag. But the API for using these optimizations is fairly brittle and could change in the next Java version.
But long story short, they implemented something to turn on the flag, and improve their own HNSW performance. Fingers crossed Java gets its act together.
1. Lucene is trying to get Approximate Nearest Neighbours (ANN) search working for semantic search purposes: https://issues.apache.org/jira/browse/LUCENE-9004 https://github.com/apache/lucene/issues/10047
2. The Panama Vector API allows CPU's that support it to accelerate vector operations: https://openjdk.org/jeps/438
So this allows fast ANN on Lucene for semantic search!
How did people do this before Lucene supported it? Only through entirely different tools?