High-Fidelity Simultaneous Speech-To-Speech Translation

Tom Labiausse; Laurent Mazar\'e; Edouard Grave; Patrick P\'erez,; Alexandre D\'efossez; Neil Zeghidour

arXiv:2502.03382·cs.CL·February 27, 2025

High-Fidelity Simultaneous Speech-To-Speech Translation

Tom Labiausse, Laurent Mazar\'e, Edouard Grave, Patrick P\'erez,, Alexandre D\'efossez, Neil Zeghidour

PDF

Open Access 1 Repo 10 Models 1 Video

TL;DR

Hibiki is a novel decoder-only model for simultaneous speech translation that processes source and target speech in real-time, producing high-quality, natural translations with adaptable timing and on-device feasibility.

Contribution

The paper introduces Hibiki, a new model that jointly handles speech-to-speech translation in real-time using a multistream approach and a weakly-supervised delay optimization method.

Findings

01

Achieves state-of-the-art translation quality on French-English tasks.

02

Demonstrates high speaker fidelity and naturalness in translations.

03

Supports real-time, on-device deployment with simple inference.

Abstract

We introduce Hibiki, a decoder-only model for simultaneous speech translation. Hibiki leverages a multistream language model to synchronously process source and target speech, and jointly produces text and audio tokens to perform speech-to-text and speech-to-speech translation. We furthermore address the fundamental challenge of simultaneous interpretation, which unlike its consecutive counterpart, where one waits for the end of the source utterance to start translating, adapts its flow to accumulate just enough context to produce a correct translation in real-time, chunk by chunk. To do so, we introduce a weakly-supervised method that leverages the perplexity of an off-the-shelf text translation system to identify optimal delays on a per-word basis and create aligned synthetic data. After supervised training, Hibiki performs adaptive, simultaneous speech translation with vanilla…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kyutai-labs/hibiki
pytorchOfficial

Models

Videos

High-Fidelity Simultaneous Speech-To-Speech Translation· slideslive

Taxonomy

TopicsSpeech Recognition and Synthesis