TL;DR
This paper introduces MLLM-CompBench, a comprehensive benchmark for evaluating the comparative reasoning abilities of multimodal large language models across diverse visual domains and dimensions.
Contribution
The paper presents a new benchmark dataset with 40K image pairs and questions to assess MLLMs' comparative reasoning, highlighting current limitations and guiding future improvements.
Findings
Recent MLLMs show significant shortcomings in comparative reasoning.
The benchmark covers eight dimensions of comparison across diverse visual domains.
Evaluation results identify specific areas for model enhancement.
Abstract
The ability to compare objects, scenes, or situations is crucial for effective decision-making and problem-solving in everyday life. For instance, comparing the freshness of apples enables better choices during grocery shopping while comparing sofa designs helps optimize the aesthetics of our living space. Despite its significance, the comparative capability is largely unexplored in artificial general intelligence (AGI). In this paper, we introduce MLLM-CompBench, a benchmark designed to evaluate the comparative reasoning capability of multimodal large language models (MLLMs). MLLM-CompBench mines and pairs images through visually oriented questions covering eight dimensions of relative comparison: visual attribute, existence, state, emotion, temporality, spatiality, quantity, and quality. We curate a collection of around 40K image pairs using metadata from diverse vision datasets and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
MethodsContrastive Language-Image Pre-training
