ConvMix: A Mixed-Criteria Data Augmentation Framework for Conversational Dense Retrieval

Fengran Mo; Jinghan Zhang; Yuchen Hui; Jia Ao Sun; Zhichao Xu; Zhan Su; Jian-Yun Nie

arXiv:2508.04001·cs.IR·November 13, 2025

ConvMix: A Mixed-Criteria Data Augmentation Framework for Conversational Dense Retrieval

Fengran Mo, Jinghan Zhang, Yuchen Hui, Jia Ao Sun, Zhichao Xu, Zhan Su, Jian-Yun Nie

PDF

1 Video

TL;DR

ConvMix is a novel data augmentation framework that enhances conversational dense retrieval by generating diverse, high-quality training samples using large language models, leading to improved performance across multiple benchmarks.

Contribution

We introduce ConvMix, a mixed-criteria data augmentation framework utilizing large language models for scalable, diverse, and high-quality training data in conversational dense retrieval.

Findings

01

Outperforms previous baselines on five benchmarks.

02

Improves retrieval accuracy with augmented training data.

03

Demonstrates the effectiveness of mixed-criteria augmentation.

Abstract

Conversational search aims to satisfy users' complex information needs via multiple-turn interactions. The key challenge lies in revealing real users' search intent from the context-dependent queries. Previous studies achieve conversational search by fine-tuning a conversational dense retriever with relevance judgments between pairs of context-dependent queries and documents. However, this training paradigm encounters data scarcity issues. To this end, we propose ConvMix, a mixed-criteria framework to augment conversational dense retrieval, which covers more aspects than existing data augmentation frameworks. We design a two-sided relevance judgment augmentation schema in a scalable manner via the aid of large language models. Besides, we integrate the framework with quality control mechanisms to obtain semantically diverse samples and near-distribution supervisions to combine various…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

ConvMix: A Mixed-Criteria Data Augmentation Framework for Conversational Dense Retrieval· underline