R4-CGQA: Retrieval-based Vision Language Models for Computer Graphics Image Quality Assessment

Zhuangzi Li; Jian Jin; Shilv Cai; Weisi Lin

arXiv:2603.10578·cs.CV·March 12, 2026

R4-CGQA: Retrieval-based Vision Language Models for Computer Graphics Image Quality Assessment

Zhuangzi Li, Jian Jin, Shilv Cai, Weisi Lin

PDF

Open Access

TL;DR

This paper introduces R4-CGQA, a retrieval-augmented vision language model framework that improves computer graphics image quality assessment by leveraging a new dataset and question-answer benchmarks, addressing the limitations of existing methods.

Contribution

The paper constructs a new dataset with quality descriptions for CG images and develops a retrieval-based framework to enhance VLMs' ability to assess CG quality accurately.

Findings

01

Current VLMs struggle with fine-grained CG quality judgment.

02

Retrieval-augmented generation significantly improves assessment performance.

03

Descriptions of similar images boost VLM understanding of CG quality.

Abstract

Immersive Computer Graphics (CGs) rendering has become ubiquitous in modern daily life. However, comprehensively evaluating CG quality remains challenging for two reasons: First, existing CG datasets lack systematic descriptions of rendering quality; and second existing CG quality assessment methods cannot provide reasonable text-based explanations. To address these issues, we first identify six key perceptual dimensions of CG quality from the user perspective and construct a dataset of 3500 CG images with corresponding quality descriptions. Each description covers CG style, content, and perceived quality along the selected dimensions. Furthermore, we use a subset of the dataset to build several question-answer benchmarks based on the descriptions in order to evaluate the responses of existing Vision Language Models (VLMs). We find that current VLMs are not sufficiently accurate in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Image and Video Quality Assessment