Chain-of-Thought Training for Open E2E Spoken Dialogue Systems

Siddhant Arora; Jinchuan Tian; Hayato Futami; Jee-weon Jung; Jiatong Shi; Yosuke Kashiwagi; Emiru Tsunoo; Shinji Watanabe

arXiv:2506.00722·cs.CL·June 3, 2025

Chain-of-Thought Training for Open E2E Spoken Dialogue Systems

Siddhant Arora, Jinchuan Tian, Hayato Futami, Jee-weon Jung, Jiatong Shi, Yosuke Kashiwagi, Emiru Tsunoo, Shinji Watanabe

PDF

Open Access

TL;DR

This paper introduces a chain-of-thought training strategy for end-to-end spoken dialogue systems, improving semantic coherence and training efficiency on limited data by aligning with multimodal language model pre-training.

Contribution

It proposes a novel chain-of-thought formulation that enhances E2E spoken dialogue systems, enabling effective training with limited data and improving response quality.

Findings

01

Achieved over 1.5 ROUGE-1 improvement over baseline.

02

Successfully trained on 300 hours of conversation data.

03

Models and code will be publicly released.

Abstract

Unlike traditional cascaded pipelines, end-to-end (E2E) spoken dialogue systems preserve full differentiability and capture non-phonemic information, making them well-suited for modeling spoken interactions. However, existing E2E approaches often require large-scale training data and generates responses lacking semantic coherence. We propose a simple yet effective strategy leveraging a chain-of-thought (CoT) formulation, ensuring that training on conversational data remains closely aligned with the multimodal language model (LM)'s pre-training on speech recognition~(ASR), text-to-speech synthesis (TTS), and text LM tasks. Our method achieves over 1.5 ROUGE-1 improvement over the baseline, successfully training spoken dialogue systems on publicly available human-human conversation datasets, while being compute-efficient enough to train on just 300 hours of public human-human conversation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Intelligent Tutoring Systems and Adaptive Learning · Topic Modeling