Spatial-Temporal Transformer based Video Compression Framework
Yanbo Gao, Wenjia Huang, Shuai Li, Hui Yuan, Mao Ye, Siwei Ma

TL;DR
This paper introduces a novel spatial-temporal transformer framework for video compression that improves motion estimation and residual coding, achieving significant bitrate savings over existing methods.
Contribution
It proposes a unified transformer-based framework with specialized modules for motion estimation, multi-reference prediction, and residual compression, addressing stability and efficiency issues in learned video compression.
Findings
Achieves 13.5% BD-Rate saving over VTM
Demonstrates stable motion estimation with RDT
Enhances residual compression efficiency
Abstract
Learned video compression (LVC) has witnessed remarkable advancements in recent years. Similar as the traditional video coding, LVC inherits motion estimation/compensation, residual coding and other modules, all of which are implemented with neural networks (NNs). However, within the framework of NNs and its training mechanism using gradient backpropagation, most existing works often struggle to consistently generate stable motion information, which is in the form of geometric features, from the input color features. Moreover, the modules such as the inter-prediction and residual coding are independent from each other, making it inefficient to fully reduce the spatial-temporal redundancy. To address the above problems, in this paper, we propose a novel Spatial-Temporal Transformer based Video Compression (STT-VC) framework. It contains a Relaxed Deformable Transformer (RDT) with Uformer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Advanced Image Processing Techniques · Advanced Data Compression Techniques
MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Byte Pair Encoding · Softmax · Dense Connections · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Residual Connection · Adam
