Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Java Panama Vector API Integrated with Apache Lucene (github.com/apache)
82 points by kurhan on May 27, 2023 | hide | past | favorite | 16 comments


A little hard to understand why this is cool, but if I understand correctly:

1. Lucene is trying to get Approximate Nearest Neighbours (ANN) search working for semantic search purposes: https://issues.apache.org/jira/browse/LUCENE-9004 https://github.com/apache/lucene/issues/10047

2. The Panama Vector API allows CPU's that support it to accelerate vector operations: https://openjdk.org/jeps/438

So this allows fast ANN on Lucene for semantic search!

How did people do this before Lucene supported it? Only through entirely different tools?


A little confusing because "vector" here (largely) refers to two different things. "Vector search" being this ANN thing, but the "Vector API" is about SIMD. SIMD provides CPU operations on a bunch of data at a time, i.e. instead of one instruction for each 32-bit float, you operate on, depending on the CPU, 128 or 256 or 512 bits worth of floats at the same time. So, over scalar code, SIMD here could get maybe a 4-16x improvement (give or take a lot - things here are pretty complicated). So, while definitely a significant change, I wouldn't say it's at the make-or-break level.


As add-on to this comment: There's another Lucene issue from 2 weeks ago that provides some more details on different approaches that were considered: https://github.com/apache/lucene/issues/12302


Great explanation. But to be clear to those who don't follow:

SIMD is supported by Java out of the box but the optimizer might miss some opportunities. With this API it is far more likely that SIMD will be used if it's available and on first compilation so performance should be improved.


> SIMD is supported by Java out of the box but the optimizer might miss some opportunities

It's a little limited due to how objects are stored in memory. Might improve with Valhalla.


Lucene here just dealing with plain float[]s, so Valhalla at least shouldn't affect it much. It seems the limiting thing here is that it has sum accumulators, which the optimizer can't reorder because addition isn't associative.

(for reference: scalar impl: https://github.com/ChrisHegarty/lucene/blob/dd4eaac9af346a21... and SIMD impl: https://github.com/ChrisHegarty/lucene/blob/dd4eaac9af346a21...)


>> How did people do this before Lucene supported it?

By performing query expansion based on features of documents within the search results. Very efficient and effective if you have indexed the right features.


This will be a big deal if Lucene got competitive on http://ann-benchmarks.com if it became a serious alternative (and more holistic) than the vector databases.

But it comes with continued challenges if I understand:

- Panama is an incubating API and Java has taken its time having an official way of using SIMD. It could all change in Java 22

- It only works on Java 20, with a very specific set of flags passed to the JVM. It’ll take time for this change to make it into Elasticsearch and Solr

- Panama itself is a weird and very low level API.

- Lucene organizes the HNSW vector index graph alongside its inverted index segments. And these need to be merged/compacted periodically. Merging HNSW graphs, as I understand it, is computationally difficult as the graph gets rebuilt.


Elasticsearch is already on Java 20: https://github.com/elastic/elasticsearch/pull/95373


Great if this forces Java upgrades.

Using Solr with Java 8 is still quite common.


The newer GC options should have already forced people to upgrade. A use case like search churns through memory.

Maybe just what I've seen but Solr usually has users sticking with older versions of Java.


From Solr 8 on you need a Java 11 runtime.


Solr 9, for Solr 8, Java 8 is still used.

https://solr.apache.org/guide/8_11/solr-system-requirements....

Which is a pain, many products that use Solr as dependency like Sitecore, still lag behind in Solr versions.

https://www.searchstax.com/docs/searchstax-cloud-sitecore-so...

While it mostly works with more recent versions, there is no support if issues cannot be replicated on officially supported versions.


Can someone answer this - today elasticsearch/lucene does vector search using HNSWLib.

Will this allow lucene to drop hnswlib and get similar performance using native java ?


Elasticsearch/Lucene do not use HNSWLib, they use their own Java implementation of the HNSW algorithm. But Java, being a virtual machine, doesn't have access to some lower level CPU instructions known to make certain calculations of vector similarity fast (aka SIMD).

Java has taken its sweet time exposing these optimizations in a consistent way. They're available by turning on a flag. But the API for using these optimizations is fairly brittle and could change in the next Java version.

But long story short, they implemented something to turn on the flag, and improve their own HNSW performance. Fingers crossed Java gets its act together.


The issue you link to was marked resolved in 2020. That's indeed what opensearch and elasticsearch are using. Solr too probably (not sure).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: