RealSR-R1: Reinforcement Learning for Real-World Image Super-Resolution with Vision-Language Chain-of-Thought

Junbo Qiao; Miaomiao Cai; Wei Li; Xudong Huang; Jie Hu; Xinghao Chen; Shaohui Lin; Hongkai Xiong

arXiv:2506.16796·cs.CV·April 14, 2026

RealSR-R1: Reinforcement Learning for Real-World Image Super-Resolution with Vision-Language Chain-of-Thought

Junbo Qiao, Miaomiao Cai, Wei Li, Xudong Huang, Jie Hu, Xinghao Chen, Shaohui Lin, Hongkai Xiong

PDF

TL;DR

This paper introduces RealSR-R1, a novel approach for real-world image super-resolution that combines vision-language reasoning and reinforcement learning to improve detail restoration and content understanding.

Contribution

It proposes the VLCoT framework with Group Relative Policy Optimization, integrating reasoning and reward mechanisms for enhanced super-resolution performance.

Findings

01

RealSR-R1 generates more realistic details in degraded images.

02

The method improves understanding of complex and semantically rich scenes.

03

Experimental results show superior performance over existing approaches.

Abstract

Real-World Image Super-Resolution is one of the most challenging task in image restoration. However, existing methods struggle with an accurate understanding of degraded image content, leading to reconstructed results that are both low-fidelity and unnatural. We present RealSR-R1 in this work, which empowers the RealSR models with understanding and reasoning capabilities. Inspired by the success of Chain of Thought (CoT) in large language models (LLMs), we simulate the human process of handling degraded images and propose the VLCoT framework, which integrates vision and language reasoning. The framework aims to precisely restore image details by progressively generating more comprehensive text and higher-resolution images. To overcome the challenge of traditional supervised learning CoT failing to generalize to real-world scenarios, we introduce, for the first time, Group Relative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.