Contextual Document Embeddings
John X. Morris, Alexander M. Rush

TL;DR
This paper introduces methods for creating contextualized document embeddings that incorporate neighboring document information, leading to improved retrieval performance especially out-of-domain, and achieves state-of-the-art results on the MTEB benchmark.
Contribution
The paper proposes two novel approaches for contextualized document embeddings, explicitly integrating neighbor context into the embedding process, and demonstrates their effectiveness over traditional biencoders.
Findings
Both methods outperform biencoders in various settings.
Significant improvements observed especially out-of-domain.
Achieved state-of-the-art results on MTEB benchmark.
Abstract
Dense document embeddings are central to neural retrieval. The dominant paradigm is to train and construct embeddings by running encoders directly on individual documents. In this work, we argue that these embeddings, while effective, are implicitly out-of-context for targeted use cases of retrieval, and that a contextualized document embedding should take into account both the document and neighboring documents in context - analogous to contextualized word embeddings. We propose two complementary methods for contextualized document embeddings: first, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss; second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation. Results show that both methods achieve better performance than biencoders in…
Peer Reviews
Decision·ICLR 2025 Poster
1. The method is well-motivated, although it is unclear whether without the method, current state-of-the-art embedding models are not able to provide good nuanced representations. 2. The authors interpret the training of dense embedding methods and the method itself from a statistical perspective which is convincing.
The authors claim that no hard negative mining is required to achieve state-of-the-art. However, the first step of the method (grouping similar documents) is essentially hard negative mining and is shown to be a key contribution to the performance. At the end, it is mentioned that an extra hard negative per query is used to achieve the best performance.
- A new batch sampling technique. - State-of-the-art on the MTEB benchmark.
Even though the paper is motivated by adapts the model to out-of-domain corpus, it's not evaluated on the domain-shift paradigm.
(1) Overall, the paper works on an important problem and provides two clean improvements that seem well-motivated and effective. (2) The paper is well-written and clear: each section is well-organized, the methods are well-motivated, and each aspect of the method is explained clearly. (3) The experiments are thorough and include useful ablations and analysis.
(1) The contextual architecture seems more expensive than its non-contextual counterpart, so the experiments would be clearer if they also reported wall clock training time. For example, from what I understand, all four methods in table 1 were trained for the same number of steps; would the results change if they were trained for the same amount of time instead? (2) The paper states that hyperparameters were chosen based on the small-scale experiments, but later also states "For our final model
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsContrastive Learning
