ComProScanner: A multi-agent based framework for composition-property structured data extraction from scientific literature
Aritra Roy, Enrico Grisan, John Buckeridge, Chiara Gattinoni

TL;DR
ComProScanner is a multi-agent framework that automates the extraction, validation, and visualization of complex chemical composition and property data from scientific literature, enhancing dataset creation for machine learning applications.
Contribution
It introduces a novel multi-agent platform that streamlines the extraction and validation of structured chemical data from scientific articles, integrating synthesis data for comprehensive database development.
Findings
DeepSeek-V3-0324 achieved 0.82 accuracy in extraction tasks.
Framework outperforms existing models in extracting complex chemical data.
Provides a user-friendly tool for building datasets from literature.
Abstract
Since the advent of various pre-trained large language models, extracting structured knowledge from scientific text has experienced a revolutionary change compared with traditional machine learning or natural language processing techniques. Despite these advances, accessible automated tools that allow users to construct, validate, and visualise datasets from scientific literature extraction remain scarce. We therefore developed ComProScanner, an autonomous multi-agent platform that facilitates the extraction, validation, classification, and visualisation of machine-readable chemical compositions and properties, integrated with synthesis data from journal articles for comprehensive database creation. We evaluated our framework using 100 journal articles against 10 different LLMs, including both open-source and proprietary models, to extract highly complex compositions associated with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Inorganic Chemistry and Materials · Artificial Intelligence in Healthcare and Education
