Joint Speech and Text Training for LLM-Based End-to-End Spoken Dialogue State Tracking

Katia Vendrame; Bolaji Yusuf; Santosh Kesiraju; \v{S}imon Sedl\'a\v{c}ek; Old\v{r}ich Plchot; Jan \v{C}ernock\'y

arXiv:2511.22503·cs.CL·December 1, 2025

Joint Speech and Text Training for LLM-Based End-to-End Spoken Dialogue State Tracking

Katia Vendrame, Bolaji Yusuf, Santosh Kesiraju, \v{S}imon Sedl\'a\v{c}ek, Old\v{r}ich Plchot, Jan \v{C}ernock\'y

PDF

Open Access

TL;DR

This paper proposes a joint training approach using spoken and textual data to improve end-to-end spoken dialogue state tracking across multiple domains, reducing reliance on domain-specific spoken data.

Contribution

It introduces a novel joint training method leveraging textual data from various domains to enhance cross-domain generalization in spoken DST models.

Findings

01

Effective cross-domain DST performance without target domain spoken data

02

Joint training improves generalization across multiple domains

03

Reduces need for costly spoken DST data collection

Abstract

End-to-end spoken dialogue state tracking (DST) is made difficult by the tandem of having to handle speech input and data scarcity. Combining speech foundation encoders and large language models has been proposed in recent work as to alleviate some of this difficulty. Although this approach has been shown to result in strong spoken DST models, achieving state-of-the-art performance in realistic multi-turn DST, it struggles to generalize across domains and requires annotated spoken DST training data for each domain of interest. However, collecting such data for every target domain is both costly and difficult. Noting that textual DST data is more easily obtained for various domains, in this work, we propose jointly training on available spoken DST data and written textual data from other domains as a way to achieve cross-domain generalization. We conduct experiments which show the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Topic Modeling · Speech Recognition and Synthesis