Dub-S2ST: Textless Speech-to-Speech Translation for Seamless Dubbing

Jeongsoo Choi; Jaehun Kim; Joon Son Chung

arXiv:2505.20899·cs.CL·December 30, 2025

Dub-S2ST: Textless Speech-to-Speech Translation for Seamless Dubbing

Jeongsoo Choi, Jaehun Kim, Joon Son Chung

PDF

Open Access 1 Video

TL;DR

This paper presents Dub-S2ST, a novel textless speech-to-speech translation system that ensures natural, time-aligned dubbing by preserving speaker identity, speech duration, and pace through a diffusion-based translation and synthesis approach.

Contribution

It introduces a discrete diffusion-based translation model with explicit duration control and a unit-based speed adaptation mechanism, advancing speech translation for dubbing applications without text reliance.

Findings

01

Produces natural, fluent, and time-aligned speech translations.

02

Achieves competitive translation quality with preserved speech characteristics.

03

Demonstrates effectiveness through extensive experiments.

Abstract

This paper introduces a cross-lingual dubbing system that translates speech from one language to another while preserving key characteristics such as duration, speaker identity, and speaking speed. Despite the strong translation quality of existing speech translation approaches, they often overlook the transfer of speech patterns, leading to mismatches with source speech and limiting their suitability for dubbing applications. To address this, we propose a discrete diffusion-based speech-to-unit translation model with explicit duration control, enabling time-aligned translation. We then synthesize speech based on the translated units and source speaker's identity using a conditional flow matching model. Additionally, we introduce a unit-based speed adaptation mechanism that guides the translation model to produce speech at a rate consistent with the source, without relying on any text.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Dub-S2ST: Textless Speech-to-Speech Translation for Seamless Dubbing· underline

Taxonomy

TopicsHandwritten Text Recognition Techniques · Subtitles and Audiovisual Media · Speech Recognition and Synthesis

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · ALIGN