Generative Image Compression by Estimating Gradients of the Rate-variable Feature Distribution
Minghao Han, Weiyi You, Jinhua Zhang, Leheng Zhang, Ce Zhu, Shuhang Gu

TL;DR
This paper introduces a diffusion-based generative image compression method that models the compression process as a stochastic differential equation, enabling photo-realistic reconstructions with smooth rate control and minimal sampling steps.
Contribution
It presents a novel diffusion framework for generative image compression that directly models the compression process as an SDE, improving reconstruction quality and rate flexibility.
Findings
Outperforms existing GIC methods on benchmark datasets
Achieves high perceptual and statistical quality with few sampling steps
Enables smooth rate adjustment in image compression
Abstract
While learned image compression (LIC) focuses on efficient data transmission, generative image compression (GIC) extends this framework by integrating generative modeling to produce photo-realistic reconstructed images. In this paper, we propose a novel diffusion-based generative modeling framework tailored for generative image compression. Unlike prior diffusion-based approaches that indirectly exploit diffusion modeling, we reinterpret the compression process itself as a forward diffusion path governed by stochastic differential equations (SDEs). A reverse neural network is trained to reconstruct images by reversing the compression process directly, without requiring Gaussian noise initialization. This approach achieves smooth rate adjustment and photo-realistic reconstructions with only a minimal number of sampling steps. Extensive experiments on benchmark datasets demonstrate that…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The central idea of this paper is a neat one: replace the standard forward process of a DDPM with a lossy compression process, then train a network to reverse that process. I think this is an interesting and promising direction. They achieve good results against common benchmark methods for generative image compression.
I have two broad concerns: 1) this paper does not engage with previous work on the topic and 2) I suspect that some of its methods, although they achieve good practical results, are not mathematically sound. 1: I don't think this is the first paper to attempt to undo quantization error in VAE latents using diffusion models. I believe that [1], [2], and [3] are basically already doing this. There are probably more papers too, those are just the first 3 I encountered. I find it concerning that n
The manuscript is well-prepared and organized. The proposed method is novel.
1.The paper lacks comparisons with recent generative image compression methods. Examples include TACO [1], ICISP [2], and DiffEIC [3]. [1].Lee H, Kim M, Kim J H, et al. Neural image compression with text-guided encoding for both pixel-level and perceptual fidelity[J]. arXiv preprint arXiv:2403.02944, 2024. [2].Wei H, Zhou Y, Jia Y, et al. A Lightweight Model for Perceptual Image Compression via Implicit Priors[J]. arXiv preprint arXiv:2502.13988, 2025. [3].Li Z, Zhou Y, Wei H, et al. Towards ext
1. Recasting rate-variable quantization as a diffusion forward process is elegant and interesting. 2. 2-step decoding yields substantially lower diffusion-time than CDC and related methods; Table C/3 reports much faster decode than several diffusion baselines. 3. On DIV2K and CLIC, the method is consistently competitive/better on MUSIQ/CLIPIQA and sometimes on FID/KID. 4. The training design is clear and reproducible. Detailed hyperparameter settings and ablations are provided. 5. Results includ
1. The derivation of the reverse process (section 3), to my understanding, assumes Gaussianity and applies the Tweedie-Miyasawa relation (Eq. 2) directly to uniform noise. A formal justification (or approximation argument) is missing. Also, please add a citation to equation 2 (Tweedie-Miyasawa). 2. The paper claims improvements "across a range of metrics," yet PSNR and LPIPS are often worse than baselines. Improvements are mostly in perceptual quality only. Please temper the claims and discuss t
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Compression Techniques
MethodsDiffusion
