Decoupled Spatial-Temporal Transformer for Video Inpainting
Rui Liu, Hanming Deng, Yangyi Huang, Xiaoyu Shi, Lewei Lu, Wenxiu Sun,, Xiaogang Wang, Jifeng Dai, Hongsheng Li

TL;DR
This paper introduces a Decoupled Spatial-Temporal Transformer (DSTT) that improves video inpainting by separately attending to spatial textures and temporal object movements, achieving higher quality results with greater efficiency.
Contribution
The paper proposes a novel DSTT architecture that disentangles spatial and temporal attention, enhancing inpainting quality and computational efficiency over existing Transformer-based methods.
Findings
Outperforms state-of-the-art video inpainting methods
Achieves higher efficiency with reduced computational cost
Produces more plausible and temporally-coherent inpainted videos
Abstract
Video inpainting aims to fill the given spatiotemporal holes with realistic appearance but is still a challenging task even with prosperous deep learning approaches. Recent works introduce the promising Transformer architecture into deep video inpainting and achieve better performance. However, it still suffers from synthesizing blurry texture as well as huge computational cost. Towards this end, we propose a novel Decoupled Spatial-Temporal Transformer (DSTT) for improving video inpainting with exceptional efficiency. Our proposed DSTT disentangles the task of learning spatial-temporal attention into 2 sub-tasks: one is for attending temporal object movements on different frames at same spatial locations, which is achieved by temporally-decoupled Transformer block, and the other is for attending similar background textures on same frame of all spatial positions, which is achieved by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging · Computer Graphics and Visualization Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Inpainting · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Dense Connections · Softmax · Dropout · Adam
