Towards Unified Benchmark and Models for Multi-Modal Perceptual Metrics

Sara Ghazanfari; Siddharth Garg; Nicolas Flammarion; Prashanth; Krishnamurthy; Farshad Khorrami; Francesco Croce

arXiv:2412.10594·cs.CV·December 17, 2024

Towards Unified Benchmark and Models for Multi-Modal Perceptual Metrics

Sara Ghazanfari, Siddharth Garg, Nicolas Flammarion, Prashanth, Krishnamurthy, Farshad Khorrami, Francesco Croce

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces UniSim-Bench, a comprehensive benchmark for multi-modal perceptual similarity, and explores fine-tuning vision-language models to develop a unified, generalizable perceptual metric that better aligns with human perception.

Contribution

The paper presents UniSim-Bench, a new benchmark for multi-modal perceptual similarity, and demonstrates that fine-tuning models on multiple tasks improves average performance but still faces challenges in generalization.

Findings

01

General-purpose models perform reasonably well but lag on specific tasks.

02

Task-specific models do not generalize well to unseen tasks.

03

Fine-tuned models achieve higher average performance and sometimes surpass task-specific models.

Abstract

Human perception of similarity across uni- and multimodal inputs is highly complex, making it challenging to develop automated metrics that accurately mimic it. General purpose vision-language models, such as CLIP and large multi-modal models (LMMs), can be applied as zero-shot perceptual metrics, and several recent works have developed models specialized in narrow perceptual tasks. However, the extent to which existing perceptual metrics align with human perception remains unclear. To investigate this question, we introduce UniSim-Bench, a benchmark encompassing 7 multi-modal perceptual similarity tasks, with a total of 25 datasets. Our evaluation reveals that while general-purpose models perform reasonably well on average, they often lag behind specialized models on individual tasks. Conversely, metrics fine-tuned for specific tasks fail to generalize well to unseen, though related,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

saraghazanfari/unisim
pytorchOfficial

Datasets

saraghznfri/unisim_data
dataset· 31 dl
31 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsColor Science and Applications · Industrial Vision Systems and Defect Detection · Image Retrieval and Classification Techniques

MethodsContrastive Language-Image Pre-training · ALIGN