Grounding-IQA: Grounding Multimodal Language Model for Image Quality Assessment
Zheng Chen, Xun Zhang, Wenbo Li, Renjing Pei, Fenglong Song, Xiongkuo Min, Xiaohong Liu, Xin Yuan, Yong Guo, Yulun Zhang

TL;DR
This paper introduces grounding-IQA, a new paradigm for image quality assessment that combines detailed descriptions and local region analysis using multimodal grounding, supported by a large dataset and benchmark.
Contribution
It proposes the grounding-IQA framework, creates the GIQA-160K dataset, and develops GIQA-Bench for fine-grained image quality evaluation using multimodal grounding techniques.
Findings
Grounding-IQA improves fine-grained quality assessment accuracy.
The GIQA-160K dataset enables robust training and evaluation.
Experiments show enhanced performance over existing IQA methods.
Abstract
The development of multimodal large language models (MLLMs) enables the evaluation of image quality through natural language descriptions. This advancement allows for more detailed assessments. However, these MLLM-based IQA methods primarily rely on general contextual descriptions, sometimes limiting fine-grained quality assessment. To address this limitation, we introduce a new image quality assessment (IQA) task paradigm, **grounding-IQA**. This paradigm integrates multimodal referring and grounding with IQA to realize more fine-grained quality perception, thereby extending existing IQA. Specifically, grounding-IQA comprises two subtasks: grounding-IQA-description (GIQA-DES) and visual question answering (GIQA-VQA). GIQA-DES involves detailed descriptions with precise locations (e.g., bounding boxes), while GIQA-VQA focuses on quality QA for local regions. To realize grounding-IQA, we…
Peer Reviews
Decision·ICLR 2026 Poster
The manuscript is clearly organized, with a strong logical flow and well-designed figures and tables that effectively illustrate the framework and experimental setup. The writing quality is generally high, facilitating easy comprehension of complex ideas. Furthermore, the benchmark construction is comprehensive and well-motivated — evaluating models from multiple dimensions (descriptive, interrogative, and spatial grounding) offers a holistic view of multimodal IQA performance.
While the contribution of a large-scale dataset and benchmark is substantial, the main novelty resides primarily in dataset creation and annotation procedures rather than in methodological innovation. The proposed framework mainly adapts existing MLLM capabilities and fine-tuning strategies to the IQA domain. As such, the work may align more closely with a dataset or benchmark track rather than the ICLR main track, where new learning methodologies or theoretical insights are typically emphasized
1. The paper integrates grounding with IQA to localize the regions causing quality defects—making it highly actionable for editing and restoration. 2. Its end-to-end labeling/training pipeline—detection, IQA filtering, box merging, and coordinate–text fusion—is scalable and reproducible. 3. A large training corpus and a targeted benchmark with multi-faceted metrics (generation, QA, localization) enable comprehensive evaluation. 4. Consistent gains across multiple MLLM backbones, with ablations i
1. Fix typographical errors in the references. 2. Some novel IQA methods like (VisualQuality-R1) should be included. 3. Conduct user studies to evaluate how well the output descriptions align with the corresponding images.
1. Introducing multimodal grounding into IQA is a reasonable idea that aligns with human assessment logic. It also improves interpretability. 2. The dataset GIQA-160K and the benchmark GIQA-Bench are well-designed contributions to the community. The automated pipeline is well-structured, which supports further research. 3. Both quantitative and qualitative results validate the effectiveness of the proposed Grounding-IQA. Comparisons with various MLLMs (general, grounding, IQA) further confirm i
1. Although implementation details of the pipeline are provided, some settings lack explanation or analysis. For example, why the discrete coordinates are set to 20×20, and why the number of tasks differs in GIQA-Bench? 2. The work applies Llama-3 for evaluation. Considering the existence of more advanced models such as Llama-4, using them could provide more accurate evaluation results. 3. The impact of dataset scale on performance is not investigated. It is not certain whether 160K data is nece
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods
