CARV: A Diagnostic Benchmark for Compositional Analogical Reasoning in Multimodal LLMs
Yongkang Du, Xiaohan Zou, Minhao Cheng, Lu Lin

TL;DR
CARV introduces a new diagnostic benchmark with 5,500 samples to evaluate multimodal LLMs' ability to perform compositional analogical reasoning, revealing significant performance gaps and failure modes.
Contribution
This paper presents CARV, the first dataset and task specifically designed to assess the compositional analogical reasoning in multimodal LLMs, highlighting current limitations.
Findings
State-of-the-art MLLMs perform poorly on CARV, with accuracy around 40%.
Current models struggle with decomposing visual changes into symbolic rules.
Models lack robustness under diverse and complex settings.
Abstract
Analogical reasoning tests a fundamental aspect of human cognition: mapping the relation from one pair of objects to another. Existing evaluations of this ability in multimodal large language models (MLLMs) overlook the ability to compose rules from multiple sources, a critical component of higher-order intelligence. To close this gap, we introduce CARV (Compositional Analogical Reasoning in Vision), a novel task together with a 5,500-sample dataset as the first diagnostic benchmark. We extend the analogy from a single pair to multiple pairs, which requires MLLMs to extract symbolic rules from each pair and compose new transformations. Evaluation on the state-of-the-art MLLMs reveals a striking performance gap: even Gemini-2.5 Pro achieving only 40.4% accuracy, far below human-level performance of 100%. Diagnostic analysis shows two consistent failure modes: (1) decomposing visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
