Scale Contrastive Learning with Selective Attentions for Blind Image Quality Assessment
Runze Hu, Zihao Huang, Xudong Li, Bohan Fu, Yan Zhang, Sicheng Zhao

TL;DR
This paper introduces CSFIQA, a novel multi-scale blind image quality assessment framework that uses selective attention and contrastive learning to better mimic human perception and improve accuracy across diverse datasets.
Contribution
The paper proposes a new BIQA method combining selective attention and scale contrastive learning, effectively filtering redundant information and distinguishing quality variations across scales.
Findings
Achieves up to 8.8% SRCC improvement on real-world distortions.
Outperforms state-of-the-art methods on seven datasets.
Demonstrates better alignment with human visual perception.
Abstract
Human visual perception naturally evaluates image quality across multiple scales, a hierarchical process that existing blind image quality assessment (BIQA) algorithms struggle to replicate effectively. This limitation stems from a fundamental misunderstanding: current multi-scale approaches fail to recognize that quality perception varies dramatically between scales -- what appears degraded when viewed closely may look acceptable from a distance. This inconsistency not only creates misleading ``visual illusions'' during feature fusion but also introduces substantial redundant information that dilutes quality-critical features and leads to imprecise assessments. Our CSFIQA framework advances multi-scale BIQA via two key innovations: (1) a selective focus attention mechanism that mimics human visual attention by filtering out redundant cross-scale information that would otherwise mask…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
Strengths - The paper clearly identifies specific challenges of "visual illusions" and "information dilution" that plague traditional multi-scale BIQA approaches. - The Scale Contrastive Learning (SCL) framework, utilizing MOS similarity to select positive and negative pairs, and the Noise Sample Matching (NSM) mechanism, which targets regional quality variations, represent a novel and effective strategy for mitigating scale-dependent perceptual distortions. - The proposed CSFIQA achieves SOTA r
Weaknesses - One of the main motivations, "visual illusions" (the phenomenon where quality perception changes with scale), is not directly reflected in the classification component of SCL. Instead, SCL merely uses MOS to classify positive and negative pairs. Although NSM is designed to address this issue, the SCL component itself appears to adopt a method with the same limitations (regarding visual illusions) that the paper's motivation criticizes in existing approaches. - NSM relies on the stro
+ Recognizes and formalizes the long-neglected problem of scale-dependent perceptual variation, offering a new perspective for BIQA. + Provides GradCAM visualizations illustrating attention improvements over baselines. + Covers eight datasets (synthetic and authentic), with ablations on all modules and hyperparameters.
- The conceptual definition of "visual illusion" and its mathematical mapping to the contrastive loss is vague. It is unclear how the “illusion” manifests quantitatively and why contrastive learning inherently solves it. - The SCL formulation (Eq.1–3) lacks theoretical justification. Why should MOS-based pairwise distances define positive/negative relations? Are these thresholds robust across datasets? - Whether the cross-dataset results (Tab. 2) are trained on synthetic → authentic or vice vers
1. The performance is good. 2. The figures are well drawn.
1. The paper should better illustrate how the attention mechanism identifies redundant vs. informative cross-scale features. 2. The contrastive learning process needs clearer formulation. How to define positive or negative pairs. 3. Multi-scale models can be heavy. The paper should report: parameters, FLOPs, inference speed, relative to baselines. This is essential for real-time/embedded use cases.
1) The paper evaluates across seven standard datasets (synthetic and authentic distortions), performs cross-dataset tests, and includes ablations on hyperparameters (λ, [α, β], τ). This extensive coverage indicates careful empirical effort. 2) The reported +8.8% SRCC on LIVEFB and solid results on KonIQ-10k and LIVEC suggest that the proposed modules capture some useful cross-scale cues, validating the importance of scale-aware modeling. 3) The inclusion of detailed module-wise ablations (SCL
1) The combination of contrastive learning, multi-scale encoding, and selective attention follows directly from existing BIQA pipelines (CONTRIQUE, Re-IQA, MUSIQ, LoDa). The work lacks a novel theoretical formulation or perceptual model explaining why the specific design choices improve human alignment. 2) Terms like “visual illusion” and “information dilution” are used as rhetorical devices rather than measured phenomena. There is no statistical evidence that cross-scale fusion actually causes
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing Techniques and Applications
MethodsSoftmax · Attention Is All You Need · Contrastive Learning · Focus · ALIGN
