SyncDPO: Enhancing Temporal Synchronization in Video-Audio Joint Generation via Preference Learning

Xin Cheng; Xihua Wang; Ying Ba; Yuyue Wang; Kaisi Guan; Yinbo Wang; Wenpu Li; Ruihua Song

arXiv:2605.12179·cs.CV·May 13, 2026

SyncDPO: Enhancing Temporal Synchronization in Video-Audio Joint Generation via Preference Learning

Xin Cheng, Xihua Wang, Ying Ba, Yuyue Wang, Kaisi Guan, Yinbo Wang, Wenpu Li, Ruihua Song

PDF

1 Repo

TL;DR

SyncDPO introduces a novel, efficient preference learning framework to enhance temporal synchronization in video-audio joint generation, significantly improving alignment accuracy across diverse benchmarks.

Contribution

The paper proposes SyncDPO, a cost-effective preference learning method with on-the-fly negative sample construction and curriculum learning for better temporal alignment.

Findings

01

SyncDPO outperforms existing methods in temporal alignment accuracy.

02

The approach generalizes well to out-of-distribution benchmarks.

03

Extensive experiments validate the effectiveness of the proposed framework.

Abstract

Recent advancements in video-audio joint generation have achieved remarkable success in semantic correspondence. However, achieving precise temporal synchronization, which requires fine-grained alignment between audio events and their visual triggers, remains a challenging problem. The post-training method for joint generation is largely dominated by Supervised Fine-Tuning, but the commonly used Mean Squared Error loss provides insufficient penalties for subtle temporal misalignments. Direct Preference Optimization offers an alternative by introducing explicit misaligned counterparts to better improve temporal sensitivity. In this paper we propose a post-training framework SyncDPO, leveraging DPO to improve the temporal sensitivity of V-A joint generation. Conventional DPO pipelines typically depend on costly sampling-and-ranking procedures to construct preference pairs, resulting in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://syncdpo.github.io/syncdpo
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.