Aligning Spoken Dialogue Models from User Interactions
Anne Wu, Laurent Mazar\'e, Neil Zeghidour, Alexandre D\'efossez

TL;DR
This paper introduces a new framework for aligning spoken dialogue models using user interaction data, enhancing real-time speech conversations by incorporating rich dynamics and feedback.
Contribution
It presents a large-scale dataset and offline alignment method to fine-tune speech-to-speech models, improving factuality, safety, and contextual relevance in dialogue systems.
Findings
Feedback improves dialogue model performance
Large-scale annotated speech dataset created
Enhanced real-time speech interaction quality
Abstract
We propose a novel preference alignment framework for improving spoken dialogue models on real-time conversations from user interactions. Current preference learning methods primarily focus on text-based language models, and are not directly suited to the complexities of real-time speech interactions, with richer dynamics (e.g. interruption, interjection) and no explicit segmentation between speaker turns.We create a large-scale dataset of more than 150,000 preference pairs from raw multi-turn speech conversations, annotated with AI feedback, to cover preferences over both linguistic content and temporal context variations. We leverage offline alignment methods to finetune a full-duplex autoregressive speech-to-speech model. Extensive experiments demonstrate that feedback on generic conversations can be consistently effective in improving spoken dialogue models to produce more factual,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSpeech and dialogue systems · Topic Modeling · Multi-Agent Systems and Negotiation
MethodsFocus
