RxnBench: A Multimodal Benchmark for Evaluating Large Language Models on Chemical Reaction Understanding from Scientific Literature

Hanzheng Li; Xi Fang; Yixuan Li; Chaozheng Huang; Junjie Wang; Xi Wang; Hongzhe Bai; Bojun Hao; Shenyu Lin; Huiqi Liang; Linfeng Zhang; Guolin Ke

arXiv:2512.23565·cs.CV·January 29, 2026

RxnBench: A Multimodal Benchmark for Evaluating Large Language Models on Chemical Reaction Understanding from Scientific Literature

Hanzheng Li, Xi Fang, Yixuan Li, Chaozheng Huang, Junjie Wang, Xi Wang, Hongzhe Bai, Bojun Hao, Shenyu Lin, Huiqi Liang, Linfeng Zhang, Guolin Ke

PDF

Open Access 3 Datasets

TL;DR

RxnBench is a comprehensive benchmark designed to evaluate multimodal large language models' ability to understand chemical reactions from scientific literature, highlighting current limitations and guiding future improvements.

Contribution

The paper introduces RxnBench, a multi-tiered benchmark for assessing MLLMs on chemical reaction understanding from PDFs, including new tasks and evaluation protocols.

Findings

01

Models excel at extracting explicit text but struggle with chemical logic.

02

Inference-time reasoning improves performance but is still insufficient.

03

No model surpasses 50% accuracy on complex document-level tasks.

Abstract

The integration of Multimodal Large Language Models (MLLMs) into chemistry promises to revolutionize scientific discovery, yet their ability to comprehend the dense, graphical language of reactions within authentic literature remains underexplored. Here, we introduce RxnBench, a multi-tiered benchmark designed to rigorously evaluate MLLMs on chemical reaction understanding from scientific PDFs. RxnBench comprises two tasks: Single-Figure QA (SF-QA), which tests fine-grained visual perception and mechanistic reasoning using 1,525 questions derived from 305 curated reaction schemes, and Full-Document QA (FD-QA), which challenges models to synthesize information from 108 articles, requiring cross-modal integration of text, schemes, and tables. Our evaluation of MLLMs reveals a critical capability gap: while models excel at extracting explicit text, they struggle with deep chemical logic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Computational Drug Discovery Methods · Biomedical Text Mining and Ontologies