Large language model-enabled automated data extraction for concrete materials informatics
Zhanzhao Li, Kengran Yang, Qiyao He, Kai Gong

TL;DR
This paper presents a large language model-based pipeline that automates the extraction of detailed concrete materials data from scientific literature, significantly expanding accessible datasets for materials informatics.
Contribution
It introduces a generalizable LLM-powered method for extracting structured materials data, achieving high accuracy and creating the largest open database for concrete materials.
Findings
Achieved an F1 score of up to 0.97 for data extraction.
Extracted nearly 9,000 records from over 27,000 publications in one hour.
Demonstrated the pipeline's adaptability to other materials domains.
Abstract
The promise of data-driven materials discovery remains constrained by the scarcity of large, high-quality, and accessible experimental datasets. Here, we introduce a generalizable large language model (LLM)-powered pipeline for automated extraction and structuring of materials data from unstructured scientific literature, using concrete materials as a representative and particularly challenging example. The pipeline exhibits robust performance across a broad range of LLMs and achieves an score of up to 0.97 for diverse composition--process--property attributes. Within one hour, it extracts nearly 9,000 high-quality records with over 100 attributes screened from more than 27,000 publications, enabling the construction of the largest open laboratory database for blended cement concrete. Machine learning analyses underscore the importance of large, diverse, and information-rich…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
