TL;DR
This paper introduces a novel video object removal method that combines external knowledge transfer and internal context grounding to achieve more coherent and complete removal, setting new state-of-the-art results.
Contribution
It proposes a dual-guidance framework integrating foundation model distillation and framewise context attention for improved video object removal.
Findings
Achieves state-of-the-art performance on video object removal benchmarks.
Establishes the first real-world benchmark for this task.
Demonstrates effective removal of shadows, reflections, and illumination effects.
Abstract
Video object removal aims to eliminate target objects from videos while plausibly completing missing regions and preserving spatio-temporal consistency. Although diffusion models have recently advanced this task, it remains challenging to remove object-induced side effects (e.g., shadows, reflections, and illumination changes) without compromising overall coherence. This limitation stems from the insufficient physical and semantic understanding of the target object and its interactions with the scene. In this paper, we propose to introduce understanding into erasing from two complementary perspectives. Externally, we introduce a distillation scheme that transfers the relationships between objects and their induced effects from vision foundation models to video diffusion models. Internally, we propose a framewise context cross-attention mechanism that grounds each denoising block in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
