MentisOculi: Revealing the Limits of Reasoning with Mental Imagery
Jana Zeller, Thadd\"aus Wiedemer, Fanfei Li, Thomas Klein, Prasanna Mayilvahanan, Matthias Bethge, Felix Wichmann, Ryan Cotterell, Wieland Brendel

TL;DR
MentisOculi introduces a comprehensive suite to evaluate the reasoning capabilities of multimodal models using visualizations, revealing current limitations in leveraging mental imagery for improved reasoning.
Contribution
This work develops a novel evaluation framework, MentisOculi, to systematically analyze the use of visualizations in model reasoning, highlighting key limitations in current models.
Findings
Visual strategies generally do not improve performance.
UMMs suffer from errors and do not leverage ground-truth visuals.
Visual thoughts currently do not benefit model reasoning.
Abstract
Frontier models are transitioning from multimodal large language models (MLLMs) that merely ingest visual information to unified multimodal models (UMMs) capable of native interleaved generation. This shift has sparked interest in using intermediate visualizations as a reasoning aid, akin to human mental imagery. Central to this idea is the ability to form, maintain, and manipulate visual representations in a goal-oriented manner. To evaluate and probe this capability, we develop MentisOculi, a procedural, stratified suite of multi-step reasoning problems amenable to visual solution, tuned to challenge frontier models. Evaluating visual strategies ranging from latent tokens to explicit generated imagery, we find they generally fail to improve performance. Analysis of UMMs specifically exposes a critical limitation: While they possess the textual reasoning capacity to solve a task and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Data Visualization and Analytics · Topic Modeling
