Scaling Up Audio-Synchronized Visual Animation: An Efficient Training Paradigm

Lin Zhang; Zefan Cai; Yufan Zhou; Shentong Mo; Jinhong Lin; Cheng-En Wu; Yibing Wei; Yijing Zhang; Ruiyi Zhang; Wen Xiao; Tong Sun; Junjie Hu; Pedro Morgado

arXiv:2508.03955·cs.CV·August 7, 2025

Scaling Up Audio-Synchronized Visual Animation: An Efficient Training Paradigm

Lin Zhang, Zefan Cai, Yufan Zhou, Shentong Mo, Jinhong Lin, Cheng-En Wu, Yibing Wei, Yijing Zhang, Ruiyi Zhang, Wen Xiao, Tong Sun, Junjie Hu, Pedro Morgado

PDF

TL;DR

This paper introduces an efficient two-stage training method for audio-synchronized visual animation that leverages noisy videos for pretraining and minimal high-quality data for fine-tuning, enabling scalable and diverse animation generation.

Contribution

The authors propose a novel two-stage training paradigm that reduces manual curation by over 10 times and enhances scalability to diverse classes using noisy videos and minimal high-quality data.

Findings

01

Significantly reduces manual curation effort.

02

Generalizes well to diverse and open-world classes.

03

Achieves improved synchronization with multi-feature conditioning.

Abstract

Recent advances in audio-synchronized visual animation enable control of video content using audios from specific classes. However, existing methods rely heavily on expensive manual curation of high-quality, class-specific training videos, posing challenges to scaling up to diverse audio-video classes in the open world. In this work, we propose an efficient two-stage training paradigm to scale up audio-synchronized visual animation using abundant but noisy videos. In stage one, we automatically curate large-scale videos for pretraining, allowing the model to learn diverse but imperfect audio-video alignments. In stage two, we finetune the model on manually curated high-quality examples, but only at a small scale, significantly reducing the required human effort. We further enhance synchronization by allowing each frame to access rich audio context via multi-feature conditioning and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.