Symbol-based entity marker highlighting for enhanced text mining in materials science with generative AI
Junhyeong Lee, Jong Min Yuk, and Chan-Woo Lee

TL;DR
This paper introduces a hybrid text-mining framework with symbolic entity markers that significantly improves scientific data extraction from unstructured materials science literature using generative AI.
Contribution
It presents a novel hybrid approach combining multi-step and direct methods with entity marker-based annotations to enhance entity recognition and data structuring in scientific texts.
Findings
Outperforms previous entity recognition methods on benchmark datasets
Achieves up to 58% improvement in entity-level F1 score
Achieves up to 83% improvement in relation-level F1 score
Abstract
The construction of experimental datasets is essential for expanding the scope of data-driven scientific discovery. Recent advances in natural language processing (NLP) have facilitated automatic extraction of structured data from unstructured scientific literature. While existing approaches-multi-step and direct methods-offer valuable capabilities, they also come with limitations when applied independently. Here, we propose a novel hybrid text-mining framework that integrates the advantages of both methods to convert unstructured scientific text into structured data. Our approach first transforms raw text into entity-recognized text, and subsequently into structured form. Furthermore, beyond the overall data structuring framework, we also enhance entity recognition performance by introducing an entity marker-a simple yet effective technique that uses symbolic annotations to highlight…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Advanced Graph Neural Networks · Topic Modeling
