MolMole: Molecule Mining from Scientific Literature
LG AI Research, Sehyun Chun, Jiye Kim, Ahra Jo, Yeonsik Jo, Seungyul, Oh, Seungjun Lee, Kwangrok Ryoo, Jongmin Lee, Seung Hwan Kim, Byung Jun Kang,, Soonyoung Lee, Jun Ha Park, Chanwoo Moon, Jiwon Ham, Haein Lee, Heejae Han,, Jaeseung Byun, Soojong Do, Minju Ha, Dongyun Kim

TL;DR
MolMole is a comprehensive vision-based deep learning framework that automates the extraction of molecular and reaction data from scientific literature, improving accuracy and efficiency over existing tools.
Contribution
It introduces a unified pipeline combining molecule detection, reaction parsing, and chemical structure recognition, along with a new benchmark dataset and evaluation metric.
Findings
MolMole outperforms existing toolkits on benchmark and public datasets.
A new annotated dataset of 550 pages is provided for evaluation.
The framework demonstrates high accuracy in extracting chemical data from complex documents.
Abstract
The extraction of molecular structures and reaction data from scientific documents is challenging due to their varied, unstructured chemical formats and complex document layouts. To address this, we introduce MolMole, a vision-based deep learning framework that unifies molecule detection, reaction diagram parsing, and optical chemical structure recognition (OCSR) into a single pipeline for automating the extraction of chemical data directly from page-level documents. Recognizing the lack of a standard page-level benchmark and evaluation metric, we also present a testset of 550 pages annotated with molecule bounding boxes, reaction labels, and MOLfiles, along with a novel evaluation metric. Experimental results demonstrate that MolMole outperforms existing toolkits on both our benchmark and public datasets. The benchmark testset will be publicly available, and the MolMole toolkit will be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Machine Learning in Materials Science · Computational Drug Discovery Methods
