The Wisdom of Many Queries: Complexity-Diversity Principle for Dense Retriever Training
Xincan Feng, Noriki Nishida, Yusuke Sakai, Yuji Matsumoto

TL;DR
This paper introduces the Complexity-Diversity Principle for dense retriever training, showing how query complexity influences the optimal diversity level, and proposes complexity-aware training methods that improve out-of-domain generalization.
Contribution
It systematically studies multi-query synthesis, formalizes the relationship between query complexity and diversity, and proposes new training strategies for better out-of-domain performance.
Findings
Diversity benefits are strongly correlated with query complexity (r≥0.95).
Complexity-aware training improves out-of-domain performance.
Combining multi-query synthesis with CW-weighted training yields compounded gains.
Abstract
Synthetic query generation has become essential for training dense retrievers, yet prior methods generate one query per document, focusing solely on query quality. We are the first to systematically study multi-query synthesis and discover a quality-diversity trade-off: high-quality queries benefit in-domain tasks, while diverse queries benefit out-of-domain (OOD) generalization. Through controlled experiments on 4 benchmark types across Contriever, RetroMAE, and Qwen3-Embedding, we find that diversity benefit strongly correlates with query complexity (r0.95, p<0.05), approximated by content words (CW). We formalize this as the Complexity-Diversity Principle (CDP): query complexity determines optimal diversity. Based on CDP, we propose complexity-aware training: multi-query synthesis for high-complexity tasks and CW-weighted training for existing data. Both strategies improve OOD…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInformation Retrieval and Search Behavior · Data Quality and Management · Advanced Image and Video Retrieval Techniques
