Eliminating VAE for Fast and High-Resolution Generative Detail Restoration
Yan Wang, Shijie Zhao, Junlin Li, Li Zhang

TL;DR
This paper introduces GenDR-Pix, a method that eliminates the VAE in diffusion models for super-resolution, achieving faster, high-resolution image restoration with reduced memory and computational costs.
Contribution
It proposes removing the VAE using pixel-shuffle operations and multi-stage adversarial distillation, enabling efficient 4K image super-resolution in one second.
Findings
2.8x acceleration over previous methods
60% memory reduction
Restores 4K images in 1 second
Abstract
Diffusion models have attained remarkable breakthroughs in the real-world super-resolution (SR) task, albeit at slow inference and high demand on devices. To accelerate inference, recent works like GenDR adopt step distillation to minimize the step number to one. However, the memory boundary still restricts the maximum processing size, necessitating tile-by-tile restoration of high-resolution images. Through profiling the pipeline, we pinpoint that the variational auto-encoder (VAE) is the bottleneck of latency and memory. To completely solve the problem, we leverage pixel-(un)shuffle operations to eliminate the VAE, reversing the latent-based GenDR to pixel-space GenDR-Pix. However, upscale with x8 pixelshuffle may induce artifacts of repeated patterns. To alleviate the distortion, we propose a multi-stage adversarial distillation to progressively remove the encoder and decoder.…
Peer Reviews
Decision·ICLR 2026 Poster
- Clear motivation and simplification of architecture: The paper explores removing the VAE encoder and decoder in one-step super-resolution, which reduces system complexity and avoids reliance on latent-space operations. - Practical efficiency improvements: By operating entirely in pixel space, the method achieves notable gains in inference speed and memory efficiency, as supported by quantitative results. - Targeted technical solutions: The paper introduces specific techniques (e.g., MFS l
none
Overall, this paper focuses on a meaningful topic in improving the efficiency of the diffusion-based restoration process. The whole idea is easy to understand and seems to be effective. The authors identify the VAE as the primary bottleneck in both latency and memory usage of the diffusion-based SR models. The authors aim to eliminate the VAE to improve efficiency without significantly compromising visual quality. The experimental results show reductions in memory and time costs, making the prop
I appreciate the authors’ efforts to design a more effective diffusion quantization method. Here, I summarize my major concerns and questions in three parts. 1. The authors propose to replace the VAE with pixel-unshuffle and shuffle operations, so the diffusion procedure is from latent space to traditional pixel space. However, it is unclear about the effectiveness of this strategy. The authors could give a deeper theoretical justification for why latent-space diffusion can be replaced by pixe
1. The idea of achieving efficient high-resolution image restoration by removing the VAE is interesting and promising. 2. The paper is well-written.
1.Baselines are VAE latent-space-based models, and forcing the removal of the VAE to operate in pixel space introduces certain risks. As shown in Tables 2 and 3, the proposed method does not achieve the best results. The VAE removal strategy requires more thorough analysis and theoretical justification, which is currently lacking in the paper. The authors do not explain the underlying motivation beyond efficiency, nor do they clarify why this approach works in practice. 2. If the goal is to red
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image Processing Techniques · Generative Adversarial Networks and Image Synthesis · Image and Video Quality Assessment
