MiniMax-Remover: Taming Bad Noise Helps Video Object Removal
Bojia Zi, Weixuan Peng, Xianbiao Qi, Jianan Wang, Shihao Zhao, Rong Xiao, Kam-Fai Wong

TL;DR
MiniMax-Remover introduces a two-stage, efficient video object removal method that eliminates the need for textual guidance and reduces sampling steps, achieving state-of-the-art results with faster inference.
Contribution
The paper presents a novel lightweight, two-stage video object removal approach that removes reliance on textual input and classifier-free guidance, improving efficiency and effectiveness.
Findings
Achieves state-of-the-art removal results with as few as 6 sampling steps
Does not rely on classifier-free guidance, enhancing inference speed
Demonstrates superior performance over existing methods through extensive experiments
Abstract
Recent advances in video diffusion models have driven rapid progress in video editing techniques. However, video object removal, a critical subtask of video editing, remains challenging due to issues such as hallucinated objects and visual artifacts. Furthermore, existing methods often rely on computationally expensive sampling procedures and classifier-free guidance (CFG), resulting in slow inference. To address these limitations, we propose MiniMax-Remover, a novel two-stage video object removal approach. Motivated by the observation that text condition is not best suited for this task, we simplify the pretrained video generation model by removing textual input and cross-attention layers, resulting in a more lightweight and efficient model architecture in the first stage. In the second stage, we distilled our remover on successful videos produced by the stage-1 model and curated by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCCD and CMOS Imaging Sensors · Digital Media Forensic Detection · Physical Unclonable Functions (PUFs) and Hardware Security
MethodsDiffusion
