Deficiency-Aware Masked Transformer for Video Inpainting
Yongsheng Yu, Heng Fan, Libo Zhang

TL;DR
This paper introduces a deficiency-aware masked transformer framework for video inpainting that effectively handles cases lacking cross-frame guidance by leveraging a dual-model approach, attention mechanisms, and contextual modules.
Contribution
The paper proposes a novel dual-modality-compatible inpainting framework with pretraining, selective self-attention, and a contextualizer to improve video inpainting, especially in deficiency scenarios.
Findings
DMT_vid outperforms previous methods on YouTube-VOS and DAVIS datasets.
Pretraining with DMT_img enhances hallucination in deficiency cases.
Selective attention accelerates inference and reduces noise.
Abstract
Recent video inpainting methods have made remarkable progress by utilizing explicit guidance, such as optical flow, to propagate cross-frame pixels. However, there are cases where cross-frame recurrence of the masked video is not available, resulting in a deficiency. In such situation, instead of borrowing pixels from other frames, the focus of the model shifts towards addressing the inverse problem. In this paper, we introduce a dual-modality-compatible inpainting framework called Deficiency-aware Masked Transformer (DMT), which offers three key advantages. Firstly, we pretrain a image inpainting model DMT_img serve as a prior for distilling the video model DMT_vid, thereby benefiting the hallucination of deficiency cases. Secondly, the self-attention module selectively incorporates spatiotemporal tokens to accelerate inference and remove noise signals. Thirdly, a simple yet effective…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Image Processing Techniques · Advanced Vision and Imaging
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Absolute Position Encodings · Adam · Layer Normalization
