Building Reasonable Inference for Vision-Language Models in Blind Image Quality Assessment
Yuan Li, Zitang Sun, Yen-ju Chen, Shin'ya Nishida

TL;DR
This paper identifies issues in vision-language models for blind image quality assessment, such as prediction instability and weak grounding, and proposes a two-stage tuning method to improve stability and human-like reasoning.
Contribution
The paper introduces a two-stage tuning approach that separates visual perception from quality inference in VLMs, enhancing stability and interpretability in BIQA tasks.
Findings
Reduces prediction instability from 22.00% to 12.39%.
Achieves significant SRCC/PLCC improvements across multiple datasets.
Enhances the reliability and human-likeness of model reasoning.
Abstract
Recent progress in BIQA has been driven by VLMs, whose semantic reasoning abilities suggest that they might extract visual features, generate descriptive text, and infer quality in a human-like manner. However, these models often produce textual descriptions that contradict their final quality predictions, and the predicted scores can change unstably during inference - behaviors not aligned with human reasoning. To understand these issues, we analyze the factors that cause contradictory assessments and instability. We first estimate the relationship between the final quality predictions and the generated visual features, finding that the predictions are not fully grounded in the features and that the logical connection between them is weak. Moreover, decoding intermediate VLM layers shows that the model frequently relies on a limited set of candidate tokens, which contributes to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Image and Video Quality Assessment · Visual Attention and Saliency Detection
