Fine-grained Video Dubbing Duration Alignment with Segment Supervised Preference Optimization

Chaoqun Cui; Liangbin Huang; Shijing Wang; Zhe Tong; Zhaolong Huang; Xiao Zeng; Xiaofeng Liu

arXiv:2508.08550·cs.SD·August 13, 2025

Fine-grained Video Dubbing Duration Alignment with Segment Supervised Preference Optimization

Chaoqun Cui, Liangbin Huang, Shijing Wang, Zhe Tong, Zhaolong Huang, Xiao Zeng, Xiaofeng Liu

PDF

1 Video

TL;DR

This paper introduces SSPO, a novel method for aligning speech durations in video dubbing by optimizing preferences at the segment level, improving synchronization between source and target speech.

Contribution

The paper presents the Segment Supervised Preference Optimization (SSPO) method, a new approach for fine-grained duration alignment in video dubbing using segment-wise sampling and loss functions.

Findings

01

SSPO outperforms existing methods in duration alignment accuracy.

02

The method effectively reduces audio-video synchronization issues.

03

Experimental results validate the superiority of SSPO in diverse video dubbing scenarios.

Abstract

Video dubbing aims to translate original speech in visual media programs from the source language to the target language, relying on neural machine translation and text-to-speech technologies. Due to varying information densities across languages, target speech often mismatches the source speech duration, causing audio-video synchronization issues that significantly impact viewer experience. In this study, we approach duration alignment in LLM-based video dubbing machine translation as a preference optimization problem. We propose the Segment Supervised Preference Optimization (SSPO) method, which employs a segment-wise sampling strategy and fine-grained loss to mitigate duration mismatches between source and target lines. Experimental results demonstrate that SSPO achieves superior performance in duration alignment tasks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Fine-grained Video Dubbing Duration Alignment with Segment Supervised Preference Optimization· underline