Doc2SAR: A Synergistic Framework for High-Fidelity Extraction of Structure-Activity Relationships from Scientific Documents
Jiaxi Zhuang, Kangning Li, Jue Hou, Mingjun Xu, Zhifeng Gao, Hengxing Cai

TL;DR
Doc2SAR is a new framework that combines domain-specific tools and fine-tuned multimodal language models to accurately extract structure-activity relationships from scientific documents, overcoming previous limitations.
Contribution
It introduces Doc2SAR, a synergistic approach integrating specialized tools with enhanced language models, and provides DocSAR-200, a benchmark for evaluating SAR extraction methods.
Findings
Achieves 80.78% Table Recall on DocSAR-200
Outperforms GPT-4o by 51.48% in Table Recall
Demonstrates practical usability with efficient inference and a web app
Abstract
Extracting molecular structure-activity relationships (SARs) from scientific literature and patents is essential for drug discovery and materials research. However, this task remains challenging due to heterogeneous document formats and limitations of existing methods. Specifically, rule-based approaches relying on rigid templates fail to generalize across diverse document layouts, while general-purpose multimodal large language models (MLLMs) lack sufficient accuracy and reliability for specialized tasks, such as layout detection and optical chemical structure recognition (OCSR). To address these challenges, we introduce DocSAR-200, a rigorously annotated benchmark of 200 scientific documents designed specifically for evaluating SAR extraction methods. Additionally, we propose Doc2SAR, a novel synergistic framework that integrates domain-specific tools with MLLMs enhanced via…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
