Simultaneous Speech-to-Speech Translation Without Aligned Data

Tom Labiausse; Romain Fabre; Yannick Est\`eve; Alexandre D\'efossez; Neil Zeghidour

arXiv:2602.11072·cs.CL·February 12, 2026

Simultaneous Speech-to-Speech Translation Without Aligned Data

Tom Labiausse, Romain Fabre, Yannick Est\`eve, Alexandre D\'efossez, Neil Zeghidour

PDF

Open Access 1 Models 1 Datasets

TL;DR

Hibiki-Zero introduces a novel speech translation model that eliminates the need for word-level aligned data, simplifying training and enabling effective multilingual simultaneous translation with improved latency and quality.

Contribution

The paper presents Hibiki-Zero, a new approach that removes the reliance on aligned data and uses reinforcement learning to optimize latency and translation quality.

Findings

01

Achieves state-of-the-art translation accuracy and naturalness

02

Supports adaptation to new languages with limited data

03

Demonstrates effective simultaneous translation across multiple language pairs

Abstract

Simultaneous speech translation requires translating source speech into a target language in real-time while handling non-monotonic word dependencies. Traditional approaches rely on supervised training with word-level aligned data, which is difficult to collect at scale and thus depends on synthetic alignments using language-specific heuristics that are suboptimal. We propose Hibiki-Zero, which eliminates the need for word-level alignments entirely. This fundamentally simplifies the training pipeline and enables seamless scaling to diverse languages with varying grammatical structures, removing the bottleneck of designing language-specific alignment heuristics. We first train on sentence-level aligned data to learn speech translation at high latency, then apply a novel reinforcement learning strategy using GRPO to optimize latency while preserving translation quality. Hibiki-Zero…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
kyutai/hibiki-zero-3b-pytorch-bf16
model· 702 dl· ♡ 45
702 dl♡ 45

Datasets

kyutai/Audio-NTREX-4L
dataset· 143 dl
143 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling