TL;DR
This paper investigates what BERT learns for information retrieval by using diagnostic datasets based on retrieval heuristics, revealing that BERT outperforms traditional models but does not adhere to expected axioms on large-scale web corpora.
Contribution
It applies axiomatic dataset analysis to BERT in IR, showing limitations of current axioms and suggesting the need for new diagnostic criteria for large-scale data.
Findings
BERT outperforms traditional retrieval models by 40% on large-scale web corpora.
BERT does not adhere to common retrieval axioms in diagnostic datasets.
Current axiomatic analysis may not be suitable for large-scale IR datasets.
Abstract
Word embeddings, made widely popular in 2013 with the release of word2vec, have become a mainstay of NLP engineering pipelines. Recently, with the release of BERT, word embeddings have moved from the term-based embedding space to the contextual embedding space -- each term is no longer represented by a single low-dimensional vector but instead each term and \emph{its context} determine the vector weights. BERT's setup and architecture have been shown to be general enough to be applicable to many natural language tasks. Importantly for Information Retrieval (IR), in contrast to prior deep learning solutions to IR problems which required significant tuning of neural net architectures and training regimes, "vanilla BERT" has been shown to outperform existing retrieval algorithms by a wide margin, including on tasks and corpora that have long resisted retrieval effectiveness gains over…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Layer Normalization · Adam · WordPiece · Dense Connections · Multi-Head Attention · Attention Dropout · Weight Decay
