Dub-S2ST: Textless Speech-to-Speech Translation for Seamless Dubbing
Jeongsoo Choi, Jaehun Kim, Joon Son Chung

TL;DR
This paper presents Dub-S2ST, a novel textless speech-to-speech translation system that ensures natural, time-aligned dubbing by preserving speaker identity, speech duration, and pace through a diffusion-based translation and synthesis approach.
Contribution
It introduces a discrete diffusion-based translation model with explicit duration control and a unit-based speed adaptation mechanism, advancing speech translation for dubbing applications without text reliance.
Findings
Produces natural, fluent, and time-aligned speech translations.
Achieves competitive translation quality with preserved speech characteristics.
Demonstrates effectiveness through extensive experiments.
Abstract
This paper introduces a cross-lingual dubbing system that translates speech from one language to another while preserving key characteristics such as duration, speaker identity, and speaking speed. Despite the strong translation quality of existing speech translation approaches, they often overlook the transfer of speech patterns, leading to mismatches with source speech and limiting their suitability for dubbing applications. To address this, we propose a discrete diffusion-based speech-to-unit translation model with explicit duration control, enabling time-aligned translation. We then synthesize speech based on the translated units and source speaker's identity using a conditional flow matching model. Additionally, we introduce a unit-based speed adaptation mechanism that guides the translation model to produce speech at a rate consistent with the source, without relying on any text.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsHandwritten Text Recognition Techniques · Subtitles and Audiovisual Media · Speech Recognition and Synthesis
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · ALIGN
