ReactBench: A Benchmark for Topological Reasoning in MLLMs on Chemical Reaction Diagrams

Qiang Xu; Shengyuan Bai; Yu Wang; He Cao; Leqing Chen; Yuanyuan Liu; Bin Feng; Zijing Liu; Yu Li

arXiv:2604.15994·cs.AI·April 24, 2026

ReactBench: A Benchmark for Topological Reasoning in MLLMs on Chemical Reaction Diagrams

Qiang Xu, Shengyuan Bai, Yu Wang, He Cao, Leqing Chen, Yuanyuan Liu, Bin Feng, Zijing Liu, Yu Li

PDF

1 Datasets

TL;DR

ReactBench is a new benchmark designed to evaluate and expose the limitations of Multimodal Large Language Models in understanding complex topological structures in chemical reaction diagrams, highlighting a significant reasoning gap.

Contribution

The paper introduces ReactBench, a comprehensive benchmark with 1,618 QA pairs to assess structural reasoning in MLLMs on chemical diagrams, revealing a major performance gap and guiding future improvements.

Findings

01

MLLMs perform significantly worse on structural reasoning tasks than on semantic tasks.

02

A performance gap exceeding 30% was observed between different task types.

03

Controlled ablations show the bottleneck is in reasoning, not perception.

Abstract

Multimodal Large Language Models (MLLMs) excel at recognizing individual visual elements and reasoning over simple linear diagrams. However, when faced with complex topological structures involving branching paths, converging flows, and cyclic dependencies, their reasoning capabilities degrade sharply, even on tasks as basic as counting endpoints. Existing benchmarks fail to probe this gap, focusing on semantic comprehension rather than structural reasoning. We introduce ReactBench, a benchmark that reveals fundamental limitations in structural reasoning through chemical reaction diagrams. These real-world scientific diagrams offer an ideal testbed because they naturally span diverse structures from linear chains to cyclic graphs, while requiring both precise local recognition and coherent global reasoning. Our benchmark comprises 1,618 expert-annotated QA pairs across four hierarchical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

IDEA-AI4S/ReactBench
dataset· 355 dl
355 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.