Optimizing Data Extraction from Materials Science Literature: A Study of Tools Using Large Language Models
Wenkai Ning, Musen Li, Jeffrey R. Reimers, and Rika Kobayashi

TL;DR
This study evaluates five AI tools based on large language models for extracting specific data from materials science literature, highlighting their strengths and limitations to improve future scientific data extraction methods.
Contribution
It provides a comparative analysis of AI tools for data extraction in materials science, offering insights and guidance for enhancing extraction accuracy and efficiency.
Findings
Tools achieved promising precision but inconsistent data integrity.
AI tools effectively filtered irrelevant papers.
Insights suggest pathways for improving extraction methodologies.
Abstract
Large Language Models (LLMs) are increasingly utilized for large-scale extraction and organization of unstructured data owing to their exceptional Natural Language Processing (NLP) capabilities. Empowering materials design, vast amounts of data from experiments and simulations are scattered across numerous scientific publications, but high-quality experimental databases are scarce. This study considers the effectiveness and practicality of five representative AI tools (ChemDataExtractor, BERT-PSIE, ChatExtract, LangChain, and Kimi) to extract bandgaps from 200 randomly selected Materials Science publications in two presentations (arXiv and publisher versions), comparing the results to those obtained by human processing. Although the integrity of data extraction has not met expectations, encouraging results have been achieved in terms of precision and the ability to eliminate irrelevant…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Artificial Intelligence in Healthcare and Education · Scientific Computing and Data Management
