PRISM-Bench: A Benchmark of Puzzle-Based Visual Tasks with CoT Error Detection
Yusu Qian, Cheng Wan, Chao Jia, Yinfei Yang, Qingyu Zhao, Zhe Gan

TL;DR
PRISM-Bench is a new benchmark for evaluating multimodal large language models on puzzle-based visual reasoning tasks, focusing on their ability to detect errors in step-by-step reasoning chains to improve trustworthiness.
Contribution
It introduces a diagnostic benchmark with error detection in chain-of-thought reasoning, enabling detailed assessment of logical consistency in multimodal models.
Findings
State-of-the-art models often generate plausible but flawed reasoning chains.
Models struggle to identify the first incorrect step in reasoning chains.
PRISM-Bench reveals a gap between fluent answer generation and faithful reasoning.
Abstract
Multimodal large language models (MLLMs) have achieved remarkable progress on vision-language tasks, yet their reasoning processes remain sometimes unreliable. We introduce PRISM-Bench, a benchmark of puzzle-based visual challenges designed to evaluate not only whether models can solve problems, but how their reasoning unfolds. Unlike prior evaluations that measure only final-answer accuracy, PRISM-Bench introduces a diagnostic task: given a visual puzzle and a step-by-step chain-of-thought (CoT) containing exactly one error, models must identify the first incorrect step. This setting enables fine-grained assessment of logical consistency, error detection, and visual reasoning. The puzzles in PRISM-Bench require multi-step symbolic, geometric, and analogical reasoning, resisting shortcuts based on superficial pattern matching. Evaluations across state-of-the-art MLLMs reveal a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Neural Network Applications
