ML-CLIPSim: Multi-Layer CLIP Similarity for Machine-Oriented Image Quality
Feng Ding, Haisheng Fu, Jie Liang, Qihan Xu, Siyu Zhu, Jingning Han

TL;DR
This paper introduces ML-CLIPSim, a new differentiable image quality metric designed for machine-centric evaluation, which outperforms traditional metrics in aligning with machine preferences and enhances downstream task performance.
Contribution
The paper proposes ML-CLIPSim, a novel multi-layer CLIP-based similarity metric for machine-oriented image quality assessment, and constructs PCMP, a dataset for evaluating model consistency.
Findings
ML-CLIPSim aligns better with machine preferences than traditional metrics.
Using ML-CLIPSim as a compression term improves rate--task trade-offs.
ML-CLIPSim remains competitive for human quality prediction.
Abstract
We study full-reference image quality assessment from a machine-centric perspective, where images are evaluated by how well they preserve information for downstream models. We formulate machine-oriented quality as a latent machine utility and approximate it through pairwise predictive-consistency comparisons. To this end, we construct PCMP, a dataset of PSNR-matched distortion pairs labeled by consistency votes from multiple pretrained models. We further propose ML-CLIPSim, a differentiable quality metric built on a frozen CLIP visual encoder, which aggregates intermediate patch-token similarities and global image embeddings. Experiments on machine-preference benchmarks, human-IQA datasets, and learned image compression show that ML-CLIPSim better aligns with machine-oriented preferences than conventional fidelity and perceptual metrics, while remaining competitive for human quality…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
