PRISM-Bench: A Benchmark of Puzzle-Based Visual Tasks with CoT Error Detection

Yusu Qian; Cheng Wan; Chao Jia; Yinfei Yang; Qingyu Zhao; Zhe Gan

arXiv:2510.23594·cs.CV·December 2, 2025

PRISM-Bench: A Benchmark of Puzzle-Based Visual Tasks with CoT Error Detection

Yusu Qian, Cheng Wan, Chao Jia, Yinfei Yang, Qingyu Zhao, Zhe Gan

PDF

Open Access

TL;DR

PRISM-Bench is a new benchmark for evaluating multimodal large language models on puzzle-based visual reasoning tasks, focusing on their ability to detect errors in step-by-step reasoning chains to improve trustworthiness.

Contribution

It introduces a diagnostic benchmark with error detection in chain-of-thought reasoning, enabling detailed assessment of logical consistency in multimodal models.

Findings

01

State-of-the-art models often generate plausible but flawed reasoning chains.

02

Models struggle to identify the first incorrect step in reasoning chains.

03

PRISM-Bench reveals a gap between fluent answer generation and faithful reasoning.

Abstract

Multimodal large language models (MLLMs) have achieved remarkable progress on vision-language tasks, yet their reasoning processes remain sometimes unreliable. We introduce PRISM-Bench, a benchmark of puzzle-based visual challenges designed to evaluate not only whether models can solve problems, but how their reasoning unfolds. Unlike prior evaluations that measure only final-answer accuracy, PRISM-Bench introduces a diagnostic task: given a visual puzzle and a step-by-step chain-of-thought (CoT) containing exactly one error, models must identify the first incorrect step. This setting enables fine-grained assessment of logical consistency, error detection, and visual reasoning. The puzzles in PRISM-Bench require multi-step symbolic, geometric, and analogical reasoning, resisting shortcuts based on superficial pattern matching. Evaluations across state-of-the-art MLLMs reveal a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Neural Network Applications