MultiContrievers: Analysis of Dense Retrieval Representations
Seraphina Goldfarb-Tarrant, Pedro Rodriguez, Jane Dwivedi-Yu, Patrick, Lewis

TL;DR
This paper analyzes what information dense retrieval models like Contriever preserve or lose compared to underlying language models, revealing insights into their information content, bias, and sensitivity to initial conditions.
Contribution
It provides the first detailed analysis of information retention and bias in dense retrievers, highlighting their high extractability, bias independence, and sensitivity to initializations.
Findings
Contriever models have higher extractability of information.
Extractability correlates poorly with retrieval performance.
Gender bias exists but is not caused by representations.
Abstract
Dense retrievers compress source documents into (possibly lossy) vector representations, yet there is little analysis of what information is lost versus preserved, and how it affects downstream tasks. We conduct the first analysis of the information captured by dense retrievers compared to the language models they are based on (e.g., BERT versus Contriever). We use 25 MultiBert checkpoints as randomized initialisations to train MultiContrievers, a set of 25 contriever models. We test whether specific pieces of information -- such as gender and occupation -- can be extracted from contriever vectors of wikipedia-like documents. We measure this extractability via information theoretic probing. We then examine the relationship of extractability to performance and gender bias, as well as the sensitivity of these results to many random initialisations and data shuffles. We find that (1)…
Peer Reviews
Decision·Submitted to ICLR 2024
- The paper presents an analysis of retrieval model representations that has not been done before. - The results presented may be useful for future model development work for retrieval tasks.
- Motivation and framing: Examining the extractability of gender and occupation and attempting to correlate this with benchmark performance seems undermotivated and distracting - it's unclear why one would expect these two pieces of information to be crucially important for performance on the datasets. (on the other hand, the analysis with gendered queries seems reasonable) - Experimental details: The experimental section seems less than ideal in rigor, the writing in the paper is a bit scattere
- The proposed probing analysis based on the extractability for exploring the dense retrieval representation is quite interesting and novel. - The reported analysis on topic and gender is also valuable and useful; The extractability does not necessarily entail the retrieval performance, and the bias such as gender bias is not originated from the representation itself, etc.
- The probing analysis such as the correlation b/w the extractability and the retrieval is explored well. But, it is unclear how to applying the current probing analysis to obtain better retrieval or application tasks. How the retrieval method is modified such that the extractability is helpful to improve the performance? - The current experiment is restricted to only two types of information – topic and gender. An extension to other types of bias is desirable. - The extractability is consider
1. It will be interesting for the community to see detailed analysis pertaining to random initialization of dense retrievers. Although, it is not clear that BEIR is the correct evaluation set here, since it does not involve supervision---it could be the models show low variance when trained on supervised retrieval. 2. There is extensive experiments and analysis on BEIR and also on gender/topic bias. Although sometimes the plots are hard to read, and there are multiple concerns about the data an
1. The evaluation tasks are designed for a specific subset of retrieval that is notoriously hard. Perhaps evaluation on fully supervised retrieval would be more informative. 2. It is interesting to study variance in retrieval, but Contriever is not especially competitive compared to more recent dense retrievers. 3. I am not sure the reference numbers for contriever are correct. I checked the contriever paper and saw ndcg@10 is 75.8 instead of 68 for Fever, and is 67.7 instead of 65 for scifact
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Sparse Evolutionary Training · Attention Is All You Need · Softmax · WordPiece · Residual Connection · Linear Layer · Weight Decay · Dropout · Layer Normalization
