Beyond Single Embeddings: Capturing Diverse Targets with Multi-Query Retrieval
Hung-Ting Chen, Xiang Liu, Shauli Ravfogel, Eunsol Choi

TL;DR
This paper introduces AMER, a multi-embedding retrieval model that generates multiple query vectors to better capture multimodal relevance, outperforming traditional single-embedding retrievers especially when target documents are diverse.
Contribution
The paper proposes a novel autoregressive multi-embedding retriever, AMER, which improves retrieval performance by generating multiple query vectors to handle multimodal relevance distributions.
Findings
AMER achieves 4x better performance on synthetic data.
AMER shows 4-21% relative gains on real-world datasets.
Larger improvements are observed when target document embeddings are diverse.
Abstract
Most text retrievers generate \emph{one} query vector to retrieve relevant documents. Yet, the conditional distribution of relevant documents for the query may be multimodal, e.g., representing different interpretations of the query. We first quantify the limitations of existing retrievers. All retrievers we evaluate struggle more as the distance between target document embeddings grows. To address this limitation, we develop a new retriever architecture, \emph{A}utoregressive \emph{M}ulti-\emph{E}mbedding \emph{R}etriever (AMER). Our model autoregressively generates multiple query vectors, and all the predicted query vectors are used to retrieve documents from the corpus. We show that on the synthetic vectorized data, the proposed method could capture multiple target distributions perfectly, showing 4x better performance than single embedding model. We also fine-tune our model on…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper's core idea is simple, intuitive, and rigorously tested. It is also well written. - The authors tackle a core, fundamental limitation of the dominant bi-encoder retrieval paradigm. By providing a practical architecture to move "beyond single embeddings", this work opens a new and important direction for retrieval model design. - The authors look at performance gain on synthetic benchmark that showcases it's superior performance but also show the moderate gain on real world data.
- The authors state they "assume a setting" with a frozen document encoder for "faster development". This is a major experimental concession. While this makes testing easier, it creates a potential disconnect between the latest documents. - There might be a potential practical downside of higher inference cost due to *m* separate matches. - The true number of distinct answers (or answer clusters) varies per query, from one to many. This fixed parameter might not perform well in certain cases or
This paper reframes retrieval for multi-target queries as autoregressive generation of multiple query embeddings, creatively combining sequence modeling with contrastive retrieval. It uses Hungarian matching to align unordered positives and scheduled sampling to reduce exposure bias, removing a core limitation of single-vector retrievers.
(1) Baselines are incomplete. The paper does not compare against strong multi-vector retrievers (e.g., the ColBERT[1] family) or competitive query-rewriting/expansion approaches (such as MMLF[2]). Although architectural limitations of late interaction are mentioned, the lack of head-to-head metrics under matched budgets makes it hard to gauge relative advantage. (2) Single-answer regimes are underexplored. It remains unclear how the method behaves when a query has a single narrow intent. (3)
The paper tackles an important task. It shows some limitations in current methods and proposes a method that addresses them. Overall, the idea seems interesting to me, but I feel like the improvement on real world tasks is a lot lower than I expected.
My main concern is related to the experimental part: while I agree that in principle one query can be mapped to diverse documents and the experiments show that there's a degradation in performance by choosing the most diverse set of documents, the synthetic data experiments feel tailored explicitly for this case. If the proposed method achieves 100% performance on the synthetic data, it feels like the experiment is either to simple or specifically tailored for this case. I think that overall it
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Information Retrieval and Search Behavior · Advanced Graph Neural Networks
