Context Guided Transformer Entropy Modeling for Video Compression
Junlong Tong, Wei Zhang, Yaohui Jin, Xiaoyu Shen

TL;DR
The paper introduces the Context Guided Transformer (CGT) entropy model for video compression, which efficiently leverages spatio-temporal context to reduce redundancy, lowering computational costs and improving compression performance.
Contribution
The novel CGT model explicitly models spatial dependency order and reduces entropy modeling time by 65%, achieving significant BD-Rate improvements over prior methods.
Findings
Reduces entropy modeling time by approximately 65%.
Achieves an 11% BD-Rate reduction compared to state-of-the-art.
Effectively leverages temporal and spatial context for improved video compression.
Abstract
Conditional entropy models effectively leverage spatio-temporal contexts to reduce video redundancy. However, incorporating temporal context often introduces additional model complexity and increases computational cost. In parallel, many existing spatial context models lack explicit modeling the ordering of spatial dependencies, which may limit the availability of relevant context during decoding. To address these issues, we propose the Context Guided Transformer (CGT) entropy model, which estimates probability mass functions of the current frame conditioned on resampled temporal context and dependency-weighted spatial context. A temporal context resampler learns predefined latent queries to extract critical temporal information using transformer encoders, reducing downstream computational overhead. Meanwhile, a teacher-student network is designed as dependency-weighted spatial context…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Coding and Compression Technologies · Advanced Data Compression Techniques · Image and Video Quality Assessment
