EraserDiT: Fast Video Inpainting with Diffusion Transformer Model
Jie Liu, Zheng Hui

TL;DR
EraserDiT introduces a diffusion transformer-based method for fast, high-quality video inpainting that maintains long-term temporal consistency and handles large masked regions effectively.
Contribution
The paper presents a novel diffusion transformer model with a Circular Position-Shift strategy for improved long-term temporal consistency in video inpainting.
Findings
Achieves high-quality inpainting with 97 frames in 65 seconds on a single GPU.
Demonstrates superior content fidelity and texture restoration.
Maintains long-term temporal consistency effectively.
Abstract
Video object removal and inpainting are critical tasks in the fields of computer vision and multimedia processing, aimed at restoring missing or corrupted regions in video sequences. Traditional methods predominantly rely on flow-based propagation and spatio-temporal Transformers, but these approaches face limitations in effectively leveraging long-term temporal features and ensuring temporal consistency in the completion results, particularly when dealing with large masks. Consequently, performance on extensive masked areas remains suboptimal. To address these challenges, this paper introduces a novel video inpainting approach leveraging the Diffusion Transformer (DiT). DiT synergistically combines the advantages of diffusion models and transformer architectures to maintain long-term temporal consistency while ensuring high-quality inpainting results. We propose a Circular…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Image Processing Techniques · Digital Media Forensic Detection
