Towards Unified Benchmark and Models for Multi-Modal Perceptual Metrics
Sara Ghazanfari, Siddharth Garg, Nicolas Flammarion, Prashanth, Krishnamurthy, Farshad Khorrami, Francesco Croce

TL;DR
This paper introduces UniSim-Bench, a comprehensive benchmark for multi-modal perceptual similarity, and explores fine-tuning vision-language models to develop a unified, generalizable perceptual metric that better aligns with human perception.
Contribution
The paper presents UniSim-Bench, a new benchmark for multi-modal perceptual similarity, and demonstrates that fine-tuning models on multiple tasks improves average performance but still faces challenges in generalization.
Findings
General-purpose models perform reasonably well but lag on specific tasks.
Task-specific models do not generalize well to unseen tasks.
Fine-tuned models achieve higher average performance and sometimes surpass task-specific models.
Abstract
Human perception of similarity across uni- and multimodal inputs is highly complex, making it challenging to develop automated metrics that accurately mimic it. General purpose vision-language models, such as CLIP and large multi-modal models (LMMs), can be applied as zero-shot perceptual metrics, and several recent works have developed models specialized in narrow perceptual tasks. However, the extent to which existing perceptual metrics align with human perception remains unclear. To investigate this question, we introduce UniSim-Bench, a benchmark encompassing 7 multi-modal perceptual similarity tasks, with a total of 25 datasets. Our evaluation reveals that while general-purpose models perform reasonably well on average, they often lag behind specialized models on individual tasks. Conversely, metrics fine-tuned for specific tasks fail to generalize well to unseen, though related,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsColor Science and Applications · Industrial Vision Systems and Defect Detection · Image Retrieval and Classification Techniques
MethodsContrastive Language-Image Pre-training · ALIGN
