From Flat Language Labels to Typological Priors: Structured Language Conditioning for Multilingual Speech-to-Speech Translation

Yu Pan; Yang Hou; Xiongfei Wu; Liang Zhang; Yves Le Traon; Lei Ma; Jianjun Zhao

arXiv:2605.16026·cs.CL·May 18, 2026

From Flat Language Labels to Typological Priors: Structured Language Conditioning for Multilingual Speech-to-Speech Translation

Yu Pan, Yang Hou, Xiongfei Wu, Liang Zhang, Yves Le Traon, Lei Ma, Jianjun Zhao

PDF

TL;DR

This paper introduces S2ST-Omni 2, a structured multilingual speech-to-speech translation framework that leverages typological priors for improved data efficiency and performance across languages.

Contribution

It reformulates language conditioning from flat labels to structured typological priors at multiple levels, enhancing multilingual S2ST performance and data efficiency.

Findings

01

S2ST-Omni 2 outperforms existing approaches on CVSS-C across multiple metrics.

02

Ablation studies show complementary benefits of the proposed strategies.

03

Controlled data-budget analysis demonstrates improved data efficiency with typological priors.

Abstract

Compositional speech-to-speech translation (S2ST) systems built upon speech large language models (SpeechLLMs) have recently shown promising performance. However, existing S2ST systems often either neglect source-language information or encode it through a language-as-label paradigm, representing each source language as an independent flat embedding. Such a design overlooks systematic linguistic structure shared across languages, which may limit data-efficient multilingual adaptation when supervised S2ST data are scarce. To address this issue, we propose S2ST-Omni 2, a many-to-one compositional S2ST framework that systematically reformulates multilingual language conditioning from flat language labels to structured typological priors. Specifically, S2ST-Omni 2 revisits language conditioning at three levels: typology-informed hierarchical language encoding for structured source-language…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.