Optimal Transport Regularization for Speech Text Alignment in Spoken Language Models
Wenze Xu, Chun Wang, Jiazhen Yu, Sheng Chen, Liang Gao, Weihong Deng

TL;DR
This paper introduces Optimal Transport Regularization (OTReg), a novel method that aligns speech and text representations in spoken language models to improve their generalization across datasets.
Contribution
The paper proposes OTReg, a lightweight, label-free regularization technique that formulates speech-text alignment as an optimal transport problem within SLM training.
Findings
OTReg improves speech-text alignment in multilingual ASR tasks.
OTReg enhances SLM generalization across diverse datasets.
OTReg reduces the modality gap between speech and text representations.
Abstract
Spoken Language Models (SLMs), which extend Large Language Models (LLMs) to perceive speech inputs, have gained increasing attention for their potential to advance speech understanding tasks. However, despite recent progress, studies show that SLMs often struggle to generalize across datasets, even for trained languages and tasks, raising concerns about whether they process speech in a text-like manner as intended. A key challenge underlying this limitation is the modality gap between speech and text representations. The high variability in speech embeddings may allow SLMs to achieve strong in-domain performance by exploiting unintended speech variations, ultimately hindering generalization. To mitigate this modality gap, we introduce Optimal Transport Regularization (OTReg), a method that formulates speech-text alignment as an optimal transport problem and derives a regularization loss…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Speech and Audio Processing
