Q-DeepSight: Incentivizing Thinking with Images for Image Quality Assessment and Refinement
Xudong Li, Jiaxi Tan, Ziyin Zhou, Yan Zhong, Zihao Huang, Jingyuan Zheng, Yan Zhang, Xiawu Zheng, Rongrong Ji

TL;DR
Q-DeepSight introduces a multimodal, human-like reasoning framework for image quality assessment that provides localized feedback and guides image refinement, outperforming existing methods.
Contribution
It proposes a novel think-with-image framework with interleaved reasoning and evidence acquisition, trained via reinforcement learning, for more reliable and actionable IQA.
Findings
Achieves state-of-the-art results on multiple benchmarks.
Provides effective guidance for image refinement through diagnosis.
Demonstrates practical value with a training-free image enhancement framework.
Abstract
Image Quality Assessment (IQA) models are increasingly deployed as perceptual critics to guide generative models and image restoration. This role demands not only accurate scores but also actionable, localized feedback. However, current MLLM-based methods adopt a single-look, language-only paradigm, which departs from human evidence-seeking judgment and yields weakly grounded rationales, limiting their reliability for in-the-loop refinement. We propose Q-DeepSight, a think-with-image framework that emulates this human-like process. It performs interleaved Multimodal Chain-of-Thought (iMCoT) with tool-augmented evidence acquisition (e.g., crop-and-zoom) to explicitly determine where quality degrades and why. To train these long iMCoT trajectories via reinforcement learning, we introduce two techniques: Perceptual Curriculum Reward (PCR) to mitigate reward sparsity and Evidence Gradient…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
