TL;DR
GramSR introduces a diffusion-based super-resolution framework that replaces text conditioning with dense visual features, enabling faithful restoration and superior texture preservation in single-image SR tasks.
Contribution
It proposes a novel one-step diffusion SR method using dense visual features and a three-stage LoRA architecture for improved detail and texture recovery.
Findings
Outperforms existing diffusion-based SR methods on standard benchmarks.
Achieves better structural fidelity and texture realism.
Provides flexible control over different restoration aspects during inference.
Abstract
Despite recent advances, single-image super-resolution (SR) remains challenging, especially in real-world scenarios with complex degradations. Diffusion-based SR methods, particularly those built on Stable Diffusion, leverage strong generative priors but commonly rely on text conditioning derived from semantic captioning. Such textual descriptions provide only high-level semantics and lack the spatially aligned visual information required for faithful restoration, leading to a representation gap between abstract semantics and spatially aligned visual details. To address this limitation, we propose GramSR, a one-step diffusion-based SR framework that replaces text conditioning with dense visual features extracted from the low-resolution input using a pre-trained DINOv3 encoder. GramSR adopts a three-stage LoRA architecture, where pixel-level, semantic-level, and texture-level LoRA…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
