What's in Common? Multimodal Models Hallucinate When Reasoning Across Scenes

Candace Ross; Florian Bordes; Adina Williams; Polina Kirichenko; Mark Ibrahim

arXiv:2511.03768·cs.LG·November 7, 2025

What's in Common? Multimodal Models Hallucinate When Reasoning Across Scenes

Candace Ross, Florian Bordes, Adina Williams, Polina Kirichenko, Mark Ibrahim

PDF

Open Access 1 Datasets 1 Video

TL;DR

Multimodal models struggle with reasoning across scenes, often hallucinating, despite strong perception abilities, highlighting a significant gap in real-world scene understanding and reasoning capabilities.

Contribution

We introduce Common-O, a new benchmark with over 10.5k in-the-wild scene examples to evaluate reasoning across scenes and reveal current models' limitations and potential improvements.

Findings

01

Models perceive objects well in single images.

02

Reasoning across multiple scenes remains highly challenging.

03

Multi-image training improves model performance significantly.

Abstract

Multimodal language models possess a remarkable ability to handle an open-vocabulary's worth of objects. Yet the best models still suffer from hallucinations when reasoning about scenes in the real world, revealing a gap between their seemingly strong performance on existing perception benchmarks that are saturating and their reasoning in the real world. To address this gap, we build a novel benchmark of in-the-wild scenes that we call Common-O. With more than 10.5k examples using exclusively new images not found in web training data to avoid contamination, Common-O goes beyond just perception, inspired by cognitive tests for humans, to probe reasoning across scenes by asking "what's in common?". We evaluate leading multimodal language models, including models specifically trained to perform chain-of-thought reasoning. We find that perceiving objects in single images is tractable for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

facebook/Common-O
dataset· 315 dl
315 dl

Videos

What’s in Common? Multimodal Models Hallucinate When Reasoning Across Scenes· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Adversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI)