EraserDiT: Fast Video Inpainting with Diffusion Transformer Model

Jie Liu; Zheng Hui

arXiv:2506.12853·cs.CV·August 19, 2025

EraserDiT: Fast Video Inpainting with Diffusion Transformer Model

Jie Liu, Zheng Hui

PDF

Open Access

TL;DR

EraserDiT introduces a diffusion transformer-based method for fast, high-quality video inpainting that maintains long-term temporal consistency and handles large masked regions effectively.

Contribution

The paper presents a novel diffusion transformer model with a Circular Position-Shift strategy for improved long-term temporal consistency in video inpainting.

Findings

01

Achieves high-quality inpainting with 97 frames in 65 seconds on a single GPU.

02

Demonstrates superior content fidelity and texture restoration.

03

Maintains long-term temporal consistency effectively.

Abstract

Video object removal and inpainting are critical tasks in the fields of computer vision and multimedia processing, aimed at restoring missing or corrupted regions in video sequences. Traditional methods predominantly rely on flow-based propagation and spatio-temporal Transformers, but these approaches face limitations in effectively leveraging long-term temporal features and ensuring temporal consistency in the completion results, particularly when dealing with large masks. Consequently, performance on extensive masked areas remains suboptimal. To address these challenges, this paper introduces a novel video inpainting approach leveraging the Diffusion Transformer (DiT). DiT synergistically combines the advantages of diffusion models and transformer architectures to maintain long-term temporal consistency while ensuring high-quality inpainting results. We propose a Circular…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Image Processing Techniques · Digital Media Forensic Detection