Hey I wrote this about 6 months ago, nice to see it here! AMA, but please note that this is SoTA territory and things have changed significantly since then. Notably, folks are now seeing good preliminary results with SBERT (sentence level encodings instead of token-level): https://www.aclweb.org/anthology/D19-1410.pdf
I'm really interested in this general area of generating good candidate queries from documents, but haven't spent much time on it. I haven't seen it in production, and I don't think the topic gets as much attention as it should because intuitively it sounds like a really good idea, so thanks for the paper!
Results for R@1000 look pretty impressive, and I'll check out the project code. Given the high recall and low MRR, using this for the initial recall step with a rerank is definitely worth looking at. if the high recall carries over to your own data and you can rerank those top 1k to increase precision, then you've got something good.
Good question. Averaging can work for short documents, but this becomes less useful as the document text gets longer. When dealing with documents of multiple paragraphs, I've been gravitating more towards thinking "whats the most relevant part of the document?". And for this SBERT works very well - as you can pluck the document's most similar sentence(s) for the given query, and use that as a signal for the most relevant piece of information.
Of course there are many caveats here, and this is by no means a solved problem, but works well for queries that are more than a couple terms.
I'm not sure I understand in practice how that would work. You have sentence embeddings and a query embedding. You're comparing groups of sentences from a document to the query. But how do you decide how to group the sentences? Are you just up-front grouping similar sentences somehow?
Building upon @m_ke's note, when you have the sentence embeddings indexed, you use query similarity with those sentences as an additional signal that can be used further. For example, you can weight sentences that appear earlier in the document higher, you can take the top N sentences and use those, you can apply clustering to find hotspots, etc.
I'd treat it as a base signal like BM25 for a field - which is frequently tuned and used as one of many features for relevance.
Absolutely! We call it a "judgement list" in search relevance land, and one is required and used for training and testing of a relevance config/model...though the training is often manual, until your configuration and infrastructure are mature enough for learning-to-rank.
@petulla we've reached the thread depth limit so I can't reply to your last question, but typically MSMarco is used for SOTA testing of deep learning for search: https://microsoft.github.io/msmarco/
It really depends on your application but you can come up with a lot of different aggregation schemes to rerank top results based on the whole document.
Any experience clustering or classifying documents based on these high dimensional vectors? Also what have you found of dimension reduction techniques such as UMAP or good old PCA?
you mentioned the prohibitive size of the vectorizations of documents -- what role, if any, have matrix/tensor decompositions or tensor networks played in helping the search community with this?
I've never seen Tensor Networks used in search, but it wouldn't surprise me if Google, Microsoft, Yandex, and other web-scale search engines were using them, because at their scale it's extraordinarily important for size reduction. We know Google is using BERT (or a newer variant) in production, so they must be doing all kinds of things to lighten the load.
From my perspective of usually dealing with much smaller corpuses, SBERT helps with in a couple ways here practically: it reduces size requirements by an order of magnitude, and also represents sentence context easier than blending individual tokens - which IMO more closely matches the goal of information needs with longer queries.