InstructAV2AV: Instruction-Guided Audio-Video Joint Editing

Haojie Zheng; Yixin Yang; Siqi Yang; Shuchen Weng; Boxin Shi

arXiv:2605.18467·cs.CV·May 19, 2026

InstructAV2AV: Instruction-Guided Audio-Video Joint Editing

Haojie Zheng, Yixin Yang, Siqi Yang, Shuchen Weng, Boxin Shi

PDF

1 Repo

TL;DR

InstructAV2AV is a novel end-to-end framework for instruction-guided joint editing of audio and video content, utilizing a large-scale dataset and advanced training strategies to outperform existing methods.

Contribution

The paper introduces InstructAV2AV, the first scalable dataset and a new model architecture for synchronized audio-video editing guided by instructions.

Findings

01

Outperforms state-of-the-art methods on 11 metrics

02

Constructed InsAVE-80K, the first large-scale audio-video editing dataset

03

Demonstrates effective instruction following and content preservation

Abstract

Recent diffusion-based methods have achieved impressive progress in video content manipulation. However, they typically ignore the accompanying audio, leaving the audio disjointed from the edited results. In this paper, we propose InstructAV2AV, the first end-to-end framework for instruction-guided audio-video joint editing. We first develop a scalable data synthesis pipeline and construct InsAVE-80K, the first large-scale audio-video editing dataset with high-quality source-to-target pairs. With this data foundation, we adapt an audio-video generation backbone to leverage its robust priors. We concatenate the audio-video input with noisy latent codes to anchor the source context, propose the source-instruction gated attention to improve instruction following and content preservation, and introduce a two-stage training strategy to effectively transfer these pre-trained priors. Extensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://hjzheng.net/projects/InstructAV2AV
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.