TL;DR
AVI-Edit is a novel framework for precise, audio-synchronized video instance editing that uses a granularity-aware mask refiner and a self-feedback audio agent, outperforming existing methods.
Contribution
The paper introduces AVI-Edit, featuring a mask refiner and audio agent, along with a new dataset, enabling fine-grained, synchronized video editing at the instance level.
Findings
AVI-Edit achieves superior visual quality compared to state-of-the-art methods.
AVI-Edit demonstrates improved audio-visual synchronization.
AVI-Edit provides fine-grained spatial and temporal control for video editing.
Abstract
Recent advancements in video generation highlight that realistic audio-visual synchronization is crucial for engaging content creation. However, existing video editing methods largely overlook audio-visual synchronization and lack the fine-grained spatial and temporal controllability required for precise instance-level edits. In this paper, we propose AVI-Edit, a framework for audio-sync video instance editing. We propose a granularity-aware mask refiner that iteratively refines coarse user-provided masks into precise instance-level regions. We further design a self-feedback audio agent to curate high-quality audio guidance, providing fine-grained temporal control. To facilitate this task, we additionally construct a large-scale dataset with instance-centric correspondence and comprehensive annotations. Extensive experiments demonstrate that AVI-Edit outperforms state-of-the-art methods…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
