A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment
Tianhe Wu, Kede Ma, Jie Liang, Yujiu Yang, Lei Zhang

TL;DR
This paper systematically evaluates multimodal large language models for image quality assessment, revealing GPT-4V's limited ability to discriminate fine details and compare multiple images, despite its overall reasonable performance.
Contribution
It provides a comprehensive analysis of prompting strategies for MLLMs in IQA and introduces a challenging sample selection method to evaluate their capabilities.
Findings
GPT-4V aligns with human perception but struggles with fine-grained differences.
MLLMs are less effective in multi-image comparison tasks.
Prompting strategies significantly influence MLLMs' IQA performance.
Abstract
While Multimodal Large Language Models (MLLMs) have experienced significant advancement in visual understanding and reasoning, their potential to serve as powerful, flexible, interpretable, and text-driven models for Image Quality Assessment (IQA) remains largely unexplored. In this paper, we conduct a comprehensive and systematic study of prompting MLLMs for IQA. We first investigate nine prompting systems for MLLMs as the combinations of three standardized testing procedures in psychophysics (i.e., the single-stimulus, double-stimulus, and multiple-stimulus methods) and three popular prompting strategies in natural language processing (i.e., the standard, in-context, and chain-of-thought prompting). We then present a difficult sample selection procedure, taking into account sample diversity and uncertainty, to further challenge MLLMs equipped with the respective optimal prompting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image Fusion Techniques
