TL;DR
SyncDPO introduces a novel, efficient preference learning framework to enhance temporal synchronization in video-audio joint generation, significantly improving alignment accuracy across diverse benchmarks.
Contribution
The paper proposes SyncDPO, a cost-effective preference learning method with on-the-fly negative sample construction and curriculum learning for better temporal alignment.
Findings
SyncDPO outperforms existing methods in temporal alignment accuracy.
The approach generalizes well to out-of-distribution benchmarks.
Extensive experiments validate the effectiveness of the proposed framework.
Abstract
Recent advancements in video-audio joint generation have achieved remarkable success in semantic correspondence. However, achieving precise temporal synchronization, which requires fine-grained alignment between audio events and their visual triggers, remains a challenging problem. The post-training method for joint generation is largely dominated by Supervised Fine-Tuning, but the commonly used Mean Squared Error loss provides insufficient penalties for subtle temporal misalignments. Direct Preference Optimization offers an alternative by introducing explicit misaligned counterparts to better improve temporal sensitivity. In this paper we propose a post-training framework SyncDPO, leveraging DPO to improve the temporal sensitivity of V-A joint generation. Conventional DPO pipelines typically depend on costly sampling-and-ranking procedures to construct preference pairs, resulting in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
