What's in Common? Multimodal Models Hallucinate When Reasoning Across Scenes
Candace Ross, Florian Bordes, Adina Williams, Polina Kirichenko, Mark Ibrahim

TL;DR
Multimodal models struggle with reasoning across scenes, often hallucinating, despite strong perception abilities, highlighting a significant gap in real-world scene understanding and reasoning capabilities.
Contribution
We introduce Common-O, a new benchmark with over 10.5k in-the-wild scene examples to evaluate reasoning across scenes and reveal current models' limitations and potential improvements.
Findings
Models perceive objects well in single images.
Reasoning across multiple scenes remains highly challenging.
Multi-image training improves model performance significantly.
Abstract
Multimodal language models possess a remarkable ability to handle an open-vocabulary's worth of objects. Yet the best models still suffer from hallucinations when reasoning about scenes in the real world, revealing a gap between their seemingly strong performance on existing perception benchmarks that are saturating and their reasoning in the real world. To address this gap, we build a novel benchmark of in-the-wild scenes that we call Common-O. With more than 10.5k examples using exclusively new images not found in web training data to avoid contamination, Common-O goes beyond just perception, inspired by cognitive tests for humans, to probe reasoning across scenes by asking "what's in common?". We evaluate leading multimodal language models, including models specifically trained to perform chain-of-thought reasoning. We find that perceiving objects in single images is tractable for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Adversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI)
