S2ST-Omni: Hierarchical Language-Aware SpeechLLM Adaptation for Multilingual Speech-to-Speech Translation
Yu Pan, Xiongfei Wu, Yuguang Yang, Jixun Yao, Lei Ma, Jianjun Zhao

TL;DR
S2ST-Omni introduces a hierarchical, language-aware speech-to-speech translation framework that combines a modular frontend and backend, achieving high accuracy and flexibility in multilingual translation tasks.
Contribution
It presents a novel hierarchical architecture with language-aware modules and a progressive fine-tuning strategy for improved multilingual speech translation.
Findings
Achieves state-of-the-art BLEU scores on CVSS-C dataset
Outperforms recent S2ST baselines in French, German, and Spanish to English translation
Demonstrates effective language-specific acoustic and linguistic representations
Abstract
Despite recent advances in speech-to-speech translation (S2ST), it remains difficult to achieve both high translation accuracy and practical flexibility. In this paper, we present S2ST-Omni, a compositional S2ST framework that integrates a high-accuracy speech-to-text translation (S2TT) frontend with a modular, plug-and-play text-to-speech (TTS) backend, enabling independent optimization of translation and synthesis. On the S2TT side, we introduce a hybrid adapter that follows a "local-then-global" strategy to bridge a pretrained Whisper encoder and a Qwen3 LLM, yielding a hierarchical acoustic-to-semantic abstraction. Building on this bridge, we further propose a hierarchical language-aware architecture that injects source-language information at two complementary levels. At the acoustic level, Language-Aware Dual-CTC operates on intermediate adapter features and employs FiLM-style…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
MethodsADaptive gradient method with the OPTimal convergence rate · Adapter
