Data Selection for Multi-turn Dialogue Instruction Tuning

Bo Li; Shikun Zhang; Wei Ye

arXiv:2604.07892·cs.CL·April 21, 2026

Data Selection for Multi-turn Dialogue Instruction Tuning

Bo Li, Shikun Zhang, Wei Ye

PDF

3 Models

TL;DR

This paper introduces MDS, a dialogue-level data selection framework that improves the quality of multi-turn dialogue datasets for instruction tuning by scoring entire conversations based on coverage, reliability, and consistency.

Contribution

The paper presents a novel multi-turn dialogue selection method that outperforms existing single-turn and heuristic approaches, enhancing dataset quality for instruction-tuned models.

Findings

01

MDS outperforms strong baselines on three multi-turn benchmarks.

02

MDS achieves the best overall rank across reference-free and reference-based metrics.

03

MDS is more robust on long conversations under the same training budget.

Abstract

Instruction-tuned language models increasingly rely on large multi-turn dialogue corpora, but these datasets are often noisy and structurally inconsistent, with topic drift, repetitive chitchat, and mismatched answer formats across turns. We address this from a data selection perspective and propose \textbf{MDS} (Multi-turn Dialogue Selection), a dialogue-level framework that scores whole conversations rather than isolated turns. MDS combines a global coverage stage that performs bin-wise selection in the user-query trajectory space to retain representative yet non-redundant dialogues, with a local structural stage that evaluates within-dialogue reliability through entity-grounded topic grounding and information progress, together with query-answer form consistency for functional alignment. MDS outperforms strong single-turn selectors, dialogue-level LLM scorers, and heuristic baselines…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.