QualiRAG: Retrieval-Augmented Generation for Visual Quality Understanding

Linhan Cao; Wei Sun; Weixia Zhang; Xiangyang Zhu; Kaiwei Zhang; Jun Jia; Dandan Zhu; Guangtao Zhai; Xiongkuo Min

arXiv:2601.18195·cs.CV·January 27, 2026

QualiRAG: Retrieval-Augmented Generation for Visual Quality Understanding

Linhan Cao, Wei Sun, Weixia Zhang, Xiangyang Zhu, Kaiwei Zhang, Jun Jia, Dandan Zhu, Guangtao Zhai, Xiongkuo Min

PDF

Open Access

TL;DR

QualiRAG introduces a training-free retrieval-augmented generation framework that leverages large multimodal models for interpretable visual quality understanding without task-specific training.

Contribution

It proposes a novel, training-free RAG approach that dynamically generates auxiliary knowledge for visual quality assessment using large multimodal models.

Findings

01

Significant improvements over baseline models in visual quality understanding.

02

Competitive performance on visual quality comparison tasks.

03

Robust quality assessment without any task-specific training.

Abstract

Visual quality assessment (VQA) is increasingly shifting from scalar score prediction toward interpretable quality understanding -- a paradigm that demands \textit{fine-grained spatiotemporal perception} and \textit{auxiliary contextual information}. Current approaches rely on supervised fine-tuning or reinforcement learning on curated instruction datasets, which involve labor-intensive annotation and are prone to dataset-specific biases. To address these challenges, we propose \textbf{QualiRAG}, a \textit{training-free} \textbf{R}etrieval-\textbf{A}ugmented \textbf{G}eneration \textbf{(RAG)} framework that systematically leverages the latent perceptual knowledge of large multimodal models (LMMs) for visual quality perception. Unlike conventional RAG that retrieves from static corpora, QualiRAG dynamically generates auxiliary knowledge by decomposing questions into structured requests…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Image and Video Quality Assessment