# Patched-Based Swin Transformer Hyperprior for Learned Image Compression

**Authors:** Sibusiso B. Buthelezi, Jules R. Tapamo

PMC · DOI: 10.3390/jimaging12010012 · Journal of Imaging · 2025-12-26

## TL;DR

This paper introduces a new image compression method combining CNNs and Swin Transformers to better capture image details while using less data.

## Contribution

The novel hybrid framework uses a patch-based Swin Transformer hyperprior to model both local and global dependencies efficiently.

## Key findings

- The proposed method achieves higher visual quality at lower bitrates compared to CNN-based approaches.
- The model improves compression performance by learning a more accurate latent probability distribution.
- Results on Kodak, JPEG AI, and CLIC datasets show superior rate-distortion performance.

## Abstract

We present a hybrid end-to-end learned image compression framework that combines a CNN-based variational autoencoder (VAE) with an efficient hierarchical Swin Transformer to address the limitations of existing entropy models in capturing global dependencies under computational constraints. Traditional VAE-based codecs typically rely on CNN-based priors with localized receptive fields, which are insufficient for modelling the complex, high-dimensional dependencies of the latent space, thereby limiting compression efficiency. While fully global transformer-based models can capture long-range dependencies, their high computational complexity makes them impractical for high-resolution image compression. To overcome this trade-off, our approach couples a CNN-based VAE with a patch-based hierarchical Swin Transformer hyperprior that employs shifted window self-attention to effectively model both local and global contextual information while maintaining computational efficiency. The proposed framework tightly integrates this expressive entropy model with an end-to-end differentiable quantization module, enabling joint optimization of the complete rate-distortion objective. By learning a more accurate probability distribution of the latent representation, the model achieves improved bitrate estimation and a more compact latent representation, resulting in enhanced compression performance. We validate our approach on the widely used Kodak, JPEG AI, and CLIC datasets, demonstrating that the proposed hybrid architecture achieves superior rate-distortion performance, delivering higher visual quality at lower bitrates compared to methods relying on simpler CNN-based entropy priors. This work demonstrates the effectiveness of integrating efficient transformer architectures into learned image compression and highlights their potential for advancing entropy modelling beyond conventional CNN-based designs.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12842631/full.md

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/PMC12842631/full.md

## References

30 references — full list in the complete paper: https://tomesphere.com/paper/PMC12842631/full.md

---
Source: https://tomesphere.com/paper/PMC12842631