MLLM-CompBench: A Comparative Reasoning Benchmark for Multimodal LLMs

Jihyung Kil; Zheda Mai; Justin Lee; Zihe Wang; Kerrie Cheng; Lemeng; Wang; Ye Liu; Arpita Chowdhury; Wei-Lun Chao

arXiv:2407.16837·cs.CV·January 14, 2025

MLLM-CompBench: A Comparative Reasoning Benchmark for Multimodal LLMs

Jihyung Kil, Zheda Mai, Justin Lee, Zihe Wang, Kerrie Cheng, Lemeng, Wang, Ye Liu, Arpita Chowdhury, Wei-Lun Chao

PDF

1 Repo 1 Video

TL;DR

This paper introduces MLLM-CompBench, a comprehensive benchmark for evaluating the comparative reasoning abilities of multimodal large language models across diverse visual domains and dimensions.

Contribution

The paper presents a new benchmark dataset with 40K image pairs and questions to assess MLLMs' comparative reasoning, highlighting current limitations and guiding future improvements.

Findings

01

Recent MLLMs show significant shortcomings in comparative reasoning.

02

The benchmark covers eight dimensions of comparison across diverse visual domains.

03

Evaluation results identify specific areas for model enhancement.

Abstract

The ability to compare objects, scenes, or situations is crucial for effective decision-making and problem-solving in everyday life. For instance, comparing the freshness of apples enables better choices during grocery shopping while comparing sofa designs helps optimize the aesthetics of our living space. Despite its significance, the comparative capability is largely unexplored in artificial general intelligence (AGI). In this paper, we introduce MLLM-CompBench, a benchmark designed to evaluate the comparative reasoning capability of multimodal large language models (MLLMs). MLLM-CompBench mines and pairs images through visually oriented questions covering eight dimensions of relative comparison: visual attribute, existence, state, emotion, temporality, spatiality, quantity, and quality. We curate a collection of around 40K image pairs using metadata from diverse vision datasets and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

raptormai/compbench
noneOfficial

Videos

MLLM-CompBench: A Comparative Reasoning Benchmark for Multimodal LLMs· slideslive

Taxonomy

MethodsContrastive Language-Image Pre-training