Doc2SAR: A Synergistic Framework for High-Fidelity Extraction of Structure-Activity Relationships from Scientific Documents

Jiaxi Zhuang; Kangning Li; Jue Hou; Mingjun Xu; Zhifeng Gao; Hengxing Cai

arXiv:2506.21625·cs.CL·October 14, 2025

Doc2SAR: A Synergistic Framework for High-Fidelity Extraction of Structure-Activity Relationships from Scientific Documents

Jiaxi Zhuang, Kangning Li, Jue Hou, Mingjun Xu, Zhifeng Gao, Hengxing Cai

PDF

TL;DR

Doc2SAR is a new framework that combines domain-specific tools and fine-tuned multimodal language models to accurately extract structure-activity relationships from scientific documents, overcoming previous limitations.

Contribution

It introduces Doc2SAR, a synergistic approach integrating specialized tools with enhanced language models, and provides DocSAR-200, a benchmark for evaluating SAR extraction methods.

Findings

01

Achieves 80.78% Table Recall on DocSAR-200

02

Outperforms GPT-4o by 51.48% in Table Recall

03

Demonstrates practical usability with efficient inference and a web app

Abstract

Extracting molecular structure-activity relationships (SARs) from scientific literature and patents is essential for drug discovery and materials research. However, this task remains challenging due to heterogeneous document formats and limitations of existing methods. Specifically, rule-based approaches relying on rigid templates fail to generalize across diverse document layouts, while general-purpose multimodal large language models (MLLMs) lack sufficient accuracy and reliability for specialized tasks, such as layout detection and optical chemical structure recognition (OCSR). To address these challenges, we introduce DocSAR-200, a rigorously annotated benchmark of 200 scientific documents designed specifically for evaluating SAR extraction methods. Additionally, we propose Doc2SAR, a novel synergistic framework that integrates domain-specific tools with MLLMs enhanced via…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.