TL;DR
InstructAV2AV is a novel end-to-end framework for instruction-guided joint editing of audio and video content, utilizing a large-scale dataset and advanced training strategies to outperform existing methods.
Contribution
The paper introduces InstructAV2AV, the first scalable dataset and a new model architecture for synchronized audio-video editing guided by instructions.
Findings
Outperforms state-of-the-art methods on 11 metrics
Constructed InsAVE-80K, the first large-scale audio-video editing dataset
Demonstrates effective instruction following and content preservation
Abstract
Recent diffusion-based methods have achieved impressive progress in video content manipulation. However, they typically ignore the accompanying audio, leaving the audio disjointed from the edited results. In this paper, we propose InstructAV2AV, the first end-to-end framework for instruction-guided audio-video joint editing. We first develop a scalable data synthesis pipeline and construct InsAVE-80K, the first large-scale audio-video editing dataset with high-quality source-to-target pairs. With this data foundation, we adapt an audio-video generation backbone to leverage its robust priors. We concatenate the audio-video input with noisy latent codes to anchor the source context, propose the source-instruction gated attention to improve instruction following and content preservation, and introduce a two-stage training strategy to effectively transfer these pre-trained priors. Extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
