Towards Better Instruction Following Retrieval Models
Yuchen Zhuang, Aaron Trinh, Rushi Qiang, Haotian Sun, Chao Zhang, Hanjun Dai, Bo Dai

TL;DR
This paper introduces InF-IR, a large-scale instruction-following retrieval training corpus, and InF-Embed, an embedding model trained on this data, significantly improving instruction-following retrieval performance.
Contribution
The paper presents a novel high-quality corpus for instruction-following IR and a new embedding model trained on it, enhancing retrieval accuracy for user instructions.
Findings
InF-Embed outperforms baselines by 8.1% in p-MRR.
InF-IR enables efficient training of smaller encoder-only models.
Contrastive learning with triplets improves instruction alignment.
Abstract
Modern information retrieval (IR) models, trained exclusively on standard <query, passage> pairs, struggle to effectively interpret and follow explicit user instructions. We introduce InF-IR, a large-scale, high-quality training corpus tailored for enhancing retrieval models in Instruction-Following IR. InF-IR expands traditional training pairs into over 38,000 expressive <instruction, query, passage> triplets as positive samples. In particular, for each positive triplet, we generate two additional hard negative examples by poisoning both instructions and queries, then rigorously validated by an advanced reasoning model (o3-mini) to ensure semantic plausibility while maintaining instructional incorrectness. Unlike existing corpora that primarily support computationally intensive reranking tasks for decoder-only language models, the highly contrastive positive-negative triplets in InF-IR…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper addresses a real gap in instruction-following IR by providing high-quality training data where existing work offers either small-scale datasets or lower-quality synthetic data. The negative sampling strategy that independently poisons instructions and queries is more comprehensive than prior work that only contrasts documents. Quality control using o3-mini with human validation is rigorous, and the experimental evaluation is thorough—testing 7 backbone models with 12 loss variants acro
The core methodology is primarily an engineering integration of existing techniques rather than a fundamental innovation. Generating hard negatives via LLM-based perturbation and quality filtering with stronger models are established practices; the extension to instruction-query-passage triplets, while useful, is incremental. The theoretical contribution is minimal—there's no analysis of why marginal sampling preserves effectiveness or formal characterization of when multivariate objectives domi
- The contribution of a training dataset is useful and the research question is important. I do believe more focus is needed on this research direction and more training data is helpful to this endeavor. - The data curation methodology was validated with humans, making their approach trustworthy.
- Weak baselines: From my understanding all of the INF-Embed models in Table 2 were trained on additional data that the baselines did not have access to. The improvement of INF-embed in this case becomes obvious as it has access to additional data, telling us nothing new. Now, if the point of the table was to show that their approach can improve various models, then I believe it is critical that they include a baseline that is trained on the INF-Embed dataset, but without instructions, i.e., the
- High-quality dataset construction: The multi-stage synthesis + filtering pipeline results in superior data diversity and accuracy. - Principled learning formulation: The introduction of multivariate conditional contrastive learning elegantly models instruction–query–passage dependencies. - Comprehensive evaluation: Benchmarks span diverse instruction types, with consistent, reproducible gains. - Clear ablations: The study effectively isolates the contributions of data filtering, objective desi
- Synthetic bias: Heavy reliance on GPT-4o-mini for both data generation and negative synthesis may limit generalization to real-world, noisy user instructions. - Limited cross-domain validation: All experiments are text-only; evaluating on multimodal or knowledge-intensive tasks would strengthen claims about generalizability. - Interpretability: While performance gains are clear, the paper offers limited qualitative analysis (e.g., attention visualization or case studies) explaining how the mod
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning · Topic Modeling
