Understanding BERT and Search Relevance (2019)

binarymax · on April 22, 2020

Hey I wrote this about 6 months ago, nice to see it here! AMA, but please note that this is SoTA territory and things have changed significantly since then. Notably, folks are now seeing good preliminary results with SBERT (sentence level encodings instead of token-level): https://www.aclweb.org/anthology/D19-1410.pdf

charlescearl · on April 22, 2020

I've glanced at an approach augments an existing document by adding likely search terms generated by BERT model specialized for Q&A

https://cs.uwaterloo.ca/~jimmylin/publications/Nogueira_Lin_...

Do you have any thoughts on this or similar approaches in production?

binarymax · on April 22, 2020

I'm really interested in this general area of generating good candidate queries from documents, but haven't spent much time on it. I haven't seen it in production, and I don't think the topic gets as much attention as it should because intuitively it sounds like a really good idea, so thanks for the paper!

Results for R@1000 look pretty impressive, and I'll check out the project code. Given the high recall and low MRR, using this for the initial recall step with a rerank is definitely worth looking at. if the high recall carries over to your own data and you can rerank those top 1k to increase precision, then you've got something good.

petulla · on April 22, 2020

How did you end up handling the query/document asymmetry issue? Seems like query sentenceBERT/averaged document vectors?

binarymax · on April 22, 2020

Good question. Averaging can work for short documents, but this becomes less useful as the document text gets longer. When dealing with documents of multiple paragraphs, I've been gravitating more towards thinking "whats the most relevant part of the document?". And for this SBERT works very well - as you can pluck the document's most similar sentence(s) for the given query, and use that as a signal for the most relevant piece of information.

Of course there are many caveats here, and this is by no means a solved problem, but works well for queries that are more than a couple terms.

petulla · on April 22, 2020

I'm not sure I understand in practice how that would work. You have sentence embeddings and a query embedding. You're comparing groups of sentences from a document to the query. But how do you decide how to group the sentences? Are you just up-front grouping similar sentences somehow?

binarymax · on April 22, 2020

Building upon @m_ke's note, when you have the sentence embeddings indexed, you use query similarity with those sentences as an additional signal that can be used further. For example, you can weight sentences that appear earlier in the document higher, you can take the top N sentences and use those, you can apply clustering to find hotspots, etc.

I'd treat it as a base signal like BM25 for a field - which is frequently tuned and used as one of many features for relevance.

petulla · on April 22, 2020

Sort of makes sense. You'd need a ground-truth search->document dataset, I think, to test on, though. Seems a bit hand-tuned.

binarymax · on April 22, 2020

Absolutely! We call it a "judgement list" in search relevance land, and one is required and used for training and testing of a relevance config/model...though the training is often manual, until your configuration and infrastructure are mature enough for learning-to-rank.

binarymax · on April 22, 2020

@petulla we've reached the thread depth limit so I can't reply to your last question, but typically MSMarco is used for SOTA testing of deep learning for search: https://microsoft.github.io/msmarco/

petulla · on April 22, 2020

Last q: which datasets of this kind are used for SOTA testing or document clustering?

m_ke · on April 22, 2020

You index an embedding for each sentence and match against sentence embeddings, then rank documents based on highest scoring sentence matches.

petulla · on April 22, 2020

Seems problematic if the term entities/topics are spread across sentences.

m_ke · on April 22, 2020

It really depends on your application but you can come up with a lot of different aggregation schemes to rerank top results based on the whole document.

binarymax · on April 22, 2020

Also, shameless plug, but if you are interested in learning more about this, we're giving a training course in June which covers this and other NLP search techniques extensively: https://opensourceconnections.com/blog/2020/06/16/nlp-remote...

lootsauce · on April 22, 2020

Any experience clustering or classifying documents based on these high dimensional vectors? Also what have you found of dimension reduction techniques such as UMAP or good old PCA?

charleskinbote · on April 22, 2020

you mentioned the prohibitive size of the vectorizations of documents -- what role, if any, have matrix/tensor decompositions or tensor networks played in helping the search community with this?

for reference: https://ai.googleblog.com/2019/06/introducing-tensornetwork-...

binarymax · on April 22, 2020

I've never seen Tensor Networks used in search, but it wouldn't surprise me if Google, Microsoft, Yandex, and other web-scale search engines were using them, because at their scale it's extraordinarily important for size reduction. We know Google is using BERT (or a newer variant) in production, so they must be doing all kinds of things to lighten the load.

From my perspective of usually dealing with much smaller corpuses, SBERT helps with in a couple ways here practically: it reduces size requirements by an order of magnitude, and also represents sentence context easier than blending individual tokens - which IMO more closely matches the goal of information needs with longer queries.