1.5 million materials narratives generated by chatbots
Yang Jeong Park, Sung Eun Jerng, Jin-Sung Park, Choah Kwon, Chia-Wei, Hsu, Zhichu Ren, Sungroh Yoon, and Ju Li

TL;DR
This paper presents a large dataset of nearly 1.5 million AI-generated materials narratives, evaluated by humans and ChatGPT-4, to enhance materials discovery beyond literature biases.
Contribution
It introduces a vast, balanced dataset of natural language descriptions of materials, generated from multiple databases and assessed for quality, advancing AI-driven materials exploration.
Findings
Generated narratives are comparable in quality to human-written content.
The dataset covers a more diverse range of elements than traditional literature.
Human and AI evaluations show similar scoring patterns.
Abstract
The advent of artificial intelligence (AI) has enabled a comprehensive exploration of materials for various applications. However, AI models often prioritize frequently encountered materials in the scientific literature, limiting the selection of suitable candidates based on inherent physical and chemical properties. To address this imbalance, we have generated a dataset of 1,494,017 natural language-material paragraphs based on combined OQMD, Materials Project, JARVIS, COD and AFLOW2 databases, which are dominated by ab initio calculations and tend to be much more evenly distributed on the periodic table. The generated text narratives were then polled and scored by both human experts and ChatGPT-4, based on three rubrics: technical accuracy, language and structure, and relevance and depth of content, showing similar scores but with human-scored depth of content being the most lagging.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Topic Modeling
