POTSA: A Cross-Lingual Speech Alignment Framework for Speech-to-Text Translation

Xuanchen Li; Chenrui Cui; Tianrui Wang; Meng Ge; Zikang Huang; Yizhou Peng; Jin Li; Yuheng Lu; Yu Jiang; Nyima Tashi; Longbiao Wang; and Jianwu Dang

arXiv:2511.09232·cs.CL·April 1, 2026

POTSA: A Cross-Lingual Speech Alignment Framework for Speech-to-Text Translation

Xuanchen Li, Chenrui Cui, Tianrui Wang, Meng Ge, Zikang Huang, Yizhou Peng, Jin Li, Yuheng Lu, Yu Jiang, Nyima Tashi, Longbiao Wang, and Jianwu Dang

PDF

TL;DR

POTSA is a novel cross-lingual speech alignment framework utilizing Optimal Transport to improve multilingual speech-to-text translation, especially for low-resource and zero-shot languages.

Contribution

It introduces a Bias Compensation module and token-level OT constraints with a layer scheduling strategy for better speech representation alignment.

Findings

01

Achieves state-of-the-art BLEU scores on FLEURS dataset.

02

Improves translation performance by +1.29 BLEU on five languages.

03

Enhances zero-shot translation with +2.93 BLEU using limited data.

Abstract

Speech Large Language Models have achieved breakthroughs in multilingual speech-to-text translation. However, existing approaches often overlook semantic commonalities across source languages, leading to biased translation performance. In this work, we propose POTSA (Parallel Optimal Transport for Speech Alignment), a new framework based on cross-lingual parallel speech pairs and Optimal Transport, designed to bridge high- and low-resource translation gaps. First, we introduce a Bias Compensation module to coarsely align initial speech representations. Second, we impose token-level OT constraints on a Q-Former using parallel pairs to establish fine-grained representation consistency. Then, we apply a layer scheduling strategy to focus OT constraints on semantically beneficial layers. Experiments on FLEURS show our method achieves SOTA performance, with +1.29 BLEU over five common…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.