Simultaneous Speech-to-Speech Translation Without Aligned Data
Tom Labiausse, Romain Fabre, Yannick Est\`eve, Alexandre D\'efossez, Neil Zeghidour

TL;DR
Hibiki-Zero introduces a novel speech translation model that eliminates the need for word-level aligned data, simplifying training and enabling effective multilingual simultaneous translation with improved latency and quality.
Contribution
The paper presents Hibiki-Zero, a new approach that removes the reliance on aligned data and uses reinforcement learning to optimize latency and translation quality.
Findings
Achieves state-of-the-art translation accuracy and naturalness
Supports adaptation to new languages with limited data
Demonstrates effective simultaneous translation across multiple language pairs
Abstract
Simultaneous speech translation requires translating source speech into a target language in real-time while handling non-monotonic word dependencies. Traditional approaches rely on supervised training with word-level aligned data, which is difficult to collect at scale and thus depends on synthetic alignments using language-specific heuristics that are suboptimal. We propose Hibiki-Zero, which eliminates the need for word-level alignments entirely. This fundamentally simplifies the training pipeline and enables seamless scaling to diverse languages with varying grammatical structures, removing the bottleneck of designing language-specific alignment heuristics. We first train on sentence-level aligned data to learn speech translation at high latency, then apply a novel reinforcement learning strategy using GRPO to optimize latency while preserving translation quality. Hibiki-Zero…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling
