VOSR: A Vision-Only Generative Model for Image Super-Resolution

Rongyuan Wu; Lingchen Sun; Zhengqiang Zhang; Xiangtao Kong; Jixin Zhao; Shihao Wang; Lei Zhang

arXiv:2604.03225·cs.CV·April 6, 2026

VOSR: A Vision-Only Generative Model for Image Super-Resolution

Rongyuan Wu, Lingchen Sun, Zhengqiang Zhang, Xiangtao Kong, Jixin Zhao, Shihao Wang, Lei Zhang

PDF

1 Repo

TL;DR

VOSR introduces a vision-only generative model for image super-resolution that rivals text-to-image diffusion models, achieving high quality with less training cost and no multimodal pretraining.

Contribution

The paper presents VOSR, a purely visual data-trained super-resolution model that outperforms T2I-based methods in quality and efficiency, with a novel guidance strategy.

Findings

01

VOSR achieves competitive or better perceptual quality than T2I-based methods.

02

VOSR requires less than one-tenth of the training cost of comparable methods.

03

VOSR produces more faithful structures with fewer hallucinations.

Abstract

Most of the recent generative image super-resolution (SR) methods rely on adapting large text-to-image (T2I) diffusion models pretrained on web-scale text-image data. While effective, this paradigm starts from a generic T2I generator, despite that SR is fundamentally a low-resolution (LR) input-conditioned image restoration task. In this work, we investigate whether an SR model trained purely on visual data can rival T2I-based ones. To this end, we propose VOSR, a Vision-Only generative framework for SR. We first extract semantically rich and spatially grounded features from the LR input using a pretrained vision encoder as visual semantic guidance. We then revisit classifier-free guidance for training generative models and show that the standard unconditional branch is ill-suited to restoration models trained from scratch. We therefore replace it with a restoration-oriented guidance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cswry/VOSR
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.