Learning to Clarify: Multi-turn Conversations with Action-Based Contrastive Self-Training

Maximillian Chen; Ruoxi Sun; Tomas Pfister; Sercan \"O. Ar{\i}k

arXiv:2406.00222·cs.CL·July 29, 2025·1 cites

Learning to Clarify: Multi-turn Conversations with Action-Based Contrastive Self-Training

Maximillian Chen, Ruoxi Sun, Tomas Pfister, Sercan \"O. Ar{\i}k

PDF

Open Access 1 Video 3 Reviews

TL;DR

This paper introduces Action-Based Contrastive Self-Training (ACT), a data-efficient method for improving multi-turn conversational agents' ability to handle ambiguity and disambiguation without extensive labeled data.

Contribution

The paper presents ACT, a novel contrastive self-training algorithm that enhances dialogue policy learning in LLMs, especially for disambiguation tasks, using limited data and no action labels.

Findings

01

ACT outperforms supervised fine-tuning and DPO in conversation modeling tasks.

02

Demonstrates effectiveness across multiple real-world conversational datasets.

03

Enables LLMs to better recognize and reason about ambiguity in dialogue.

Abstract

Large language models (LLMs), optimized through human feedback, have rapidly emerged as a leading paradigm for developing intelligent conversational assistants. However, despite their strong performance across many benchmarks, LLM-based agents might still lack conversational skills such as disambiguation -- when they are faced with ambiguity, they often overhedge or implicitly guess users' true intents rather than asking clarification questions. Under task-specific settings, high-quality conversation samples are often limited, constituting a bottleneck for LLMs' ability to learn optimal dialogue action policies. We propose Action-Based Contrastive Self-Training (ACT), a quasi-online preference optimization algorithm based on Direct Preference Optimization (DPO), that enables data-efficient dialogue policy learning in multi-turn conversation modeling. We demonstrate ACT's efficacy under…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 2

Strengths

1. This paper proposes a feasible way to optimizing multi-turn conversation modeling for LLMs, which overcome some obstacles faced by existing methods. 2. Their proposed method ACT is sample-efficient based on online DPO, and achieves good performance on three real-world conversational tasks. 3. The paper is well-written and the experiments are convincing.

Weaknesses

This paper conducts experiments on limited applications of dialogue tasks; there are also many variants for DPO based methods, however, the authors have not compared them as baselines.

Reviewer 02Rating 6Confidence 5

Strengths

1. This paper is clear in writing. It is easy to understand the proposals and experiments. 2. Diversified tasks of the evaluation datasets. I recognize the efforts author made to experiment on different tasks (e.g. SQL, tableQA etc.)

Weaknesses

1. The biggest weakness of the paper should be the choice of weak baselines and this japardizes the contributions claims of the paper. One paper on the top of my head is https://arxiv.org/pdf/2404.19733. This paper has many identical settings: online sampling and a heuristic to filter trajectories. I believe this paper should be a baseline to compare. Maybe the authors have similar ablation comparisons in Table 6. I would like to see some clear discussions on the baselines and ablations. 2.

Reviewer 03Rating 8Confidence 3

Strengths

First and foremost, the authors present extensive comparisons showing that their approach leads to better conversational outcomes, both in terms of recognizing the need to clarify a user prompt, and in resolving the user's information need. The difference is sometimes quite large, with 10-20% improvements over SFT or a modified DPO approach. It is a significant result, showing that off-policy learning techniques can be bested by even this quasi-online learning approach. While not alone in findin

Weaknesses

The paper is probably over-stuffed with results, and is generally very wordy making it difficult to read. It is saved by the clear algorithms and figures, but the text should be shortened considerably. This would also allow for more discussion of the AmbigSQL data set, which is a significant contribution, but which is barely described in the text. As an example, I found that the first three sentences of the paragraph at L350 ("Implicit Ambiguity Recognition") all say essentially the same thing.

Videos

Learning to Clarify: Multi-turn Conversations with Action-Based Contrastive Self-Training· slideslive

Taxonomy

TopicsAcademic and Historical Perspectives in Psychology · Educational and Psychological Assessments · Counseling, Therapy, and Family Dynamics

MethodsDirect Preference Optimization