Enhancing Descriptive Image Quality Assessment with A Large-scale Multi-modal Dataset
Zhiyuan You, Jinjin Gu, Xin Cai, Zheyuan Li, Kaiwen Zhu, Chao Dong, Tianfan Xue

TL;DR
This paper introduces DepictQA-Wild, a large-scale multi-modal dataset and model for descriptive image quality assessment that improves performance across diverse real-world scenarios by leveraging a comprehensive dataset and multi-task framework.
Contribution
The paper presents a new large-scale dataset DQ-495K and a multi-functional IQA model, DepictQA-Wild, addressing limitations of prior methods and enhancing descriptive image quality assessment.
Findings
DepictQA-Wild outperforms traditional and prior VLM-based IQA models.
The dataset enables better handling of resolution and quality issues.
The model is effective in real-world image assessment tasks.
Abstract
With the rapid advancement of Vision Language Models (VLMs), VLM-based Image Quality Assessment (IQA) seeks to describe image quality linguistically to align with human expression and capture the multifaceted nature of IQA tasks. However, current methods are still far from practical usage. First, prior works focus narrowly on specific sub-tasks or settings, which do not align with diverse real-world applications. Second, their performance is sub-optimal due to limitations in dataset coverage, scale, and quality. To overcome these challenges, we introduce the enhanced Depicted image Quality Assessment model (DepictQA-Wild). Our method includes a multi-functional IQA task paradigm that encompasses both assessment and comparison tasks, brief and detailed responses, full-reference and non-reference scenarios. We introduce a ground-truth-informed dataset construction approach to enhance data…
Peer Reviews
Decision·Submitted to ICLR 2025
The introduction of the DQ-495K dataset represents a significant contribution by providing a large-scale, high-quality dataset for Image Quality Assessment (IQA). The paper discusses various IQA tasks within both full-reference and no-reference frameworks, including assessments of single images and comparisons between paired images. Additionally, the incorporation of confidence estimation for responses is an interesting enhancement that could potentially improve the reliability of the assessment
Although this paper contributes a comprehensive dataset and a multi-functional model for descriptive image quality assessment, the major weakness is the limited novelty and contribution, with some consideration that it is somewhat a data extension of the existing dataset and pipeline. Specifically, the pair-wise quality assessment and reasoning issue has been well addressed in the previous work Co-Instruct, while the performance gain of the proposed model basically comes from the augmented instr
Strengths 1. The use of VLMs for IQA is new and aligns with current trends in AI research. 2. The authors propose a comprehensive and large-scale dataset, EDQA-495K, which could be a valuable resource for the research community. 3. The paper presents a clear and well-organized structure.
1. More detailed information is needed regarding the testing of the OOD settings presented in Table 3. Additionally, some non-reference IQA models, such as LIQE, are capable of identifying distortion types as well. It is recommended that these methods be included in the comparison. 2. Table 4 should incorporate IQA datasets that feature realistic distortions, such as KonIQ-10k and SPAQ. Including these datasets will allow for a more robust evaluation of the model's generalization capabilities.
The authors have achieved multi-functional quality assessment through a general method, including quantitative indicators and qualitative language descriptions. The proposed method can adapt to a variety of practical application scenarios. The authors have achieved the purpose of multifunctional quality assessment based on VLM by reasonably designing data tags and prompts. The proposed multi-functional IQA task paradigm can provide support for subsequent work. The proposed model has achieved goo
As shown in the experimental results,the proposed method shows good multi-functional IQA performance. However, the model size of the proposed model is much larger than that of the traditional IQA models, and the model training requires more computing resources. In particular, for the classic quantitative quality assessment task, the cross-modal training adopted by the proposed model does not seem to achieve significant improvement compared with the SOTA single-modal IQA model.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image Fusion Techniques
MethodsFocus · ALIGN
