Optimal Transport Regularization for Speech Text Alignment in Spoken Language Models

Wenze Xu; Chun Wang; Jiazhen Yu; Sheng Chen; Liang Gao; Weihong Deng

arXiv:2508.08131·cs.CL·August 12, 2025

Optimal Transport Regularization for Speech Text Alignment in Spoken Language Models

Wenze Xu, Chun Wang, Jiazhen Yu, Sheng Chen, Liang Gao, Weihong Deng

PDF

Open Access

TL;DR

This paper introduces Optimal Transport Regularization (OTReg), a novel method that aligns speech and text representations in spoken language models to improve their generalization across datasets.

Contribution

The paper proposes OTReg, a lightweight, label-free regularization technique that formulates speech-text alignment as an optimal transport problem within SLM training.

Findings

01

OTReg improves speech-text alignment in multilingual ASR tasks.

02

OTReg enhances SLM generalization across diverse datasets.

03

OTReg reduces the modality gap between speech and text representations.

Abstract

Spoken Language Models (SLMs), which extend Large Language Models (LLMs) to perceive speech inputs, have gained increasing attention for their potential to advance speech understanding tasks. However, despite recent progress, studies show that SLMs often struggle to generalize across datasets, even for trained languages and tasks, raising concerns about whether they process speech in a text-like manner as intended. A key challenge underlying this limitation is the modality gap between speech and text representations. The high variability in speech embeddings may allow SLMs to achieve strong in-domain performance by exploiting unintended speech variations, ultimately hindering generalization. To mitigate this modality gap, we introduce Optimal Transport Regularization (OTReg), a method that formulates speech-text alignment as an optimal transport problem and derives a regularization loss…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Speech and Audio Processing