Beyond Final Answers: CRYSTAL Benchmark for Transparent Multimodal Reasoning Evaluation
Wayner Barrios, SouYoung Jin

TL;DR
CRYSTAL is a new benchmark for evaluating multimodal reasoning that emphasizes verifiable intermediate steps, revealing systematic failures in current models and proposing a training method that improves reasoning accuracy.
Contribution
The paper introduces CRYSTAL, a diagnostic benchmark with novel metrics and a training curriculum that enhances multimodal reasoning capabilities.
Findings
20 MLLMs evaluated reveal systematic reasoning failures.
CPR-Curriculum improves reasoning accuracy by 32%.
Models often exhibit cherry-picking and disordered reasoning.
Abstract
We introduce CRYSTAL (Clear Reasoning via Yielded Steps, Traceability, and Logic), a diagnostic benchmark with 6,372 instances that evaluates multimodal reasoning through verifiable intermediate steps. We propose two complementary metrics: Match F1, which scores step-level precision and recall via semantic similarity matching, and Ordered Match F1, which further penalizes disordered reasoning chains. References are constructed through a Delphi-inspired pipeline in which four independent MLLMs generate trajectories, which are then aggregated via semantic clustering and validated through human quality gates. Evaluation of 20 MLLMs, including commercial frontier systems not used during benchmark construction, reveals systematic failures that are invisible to answer accuracy: universal cherry-picking (precision far exceeds recall), non-monotonic scaling trade-offs, and disordered reasoning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · AI-based Problem Solving and Planning
