Evaluating the Performance and Robustness of LLMs in Materials Science Q&A and Property Predictions

Hongchen Wang; Kangming Li; Scott Ramsay; Yao Fehlis; Edward Kim; and Jason Hattrick-Simpers

arXiv:2409.14572·cs.CL·August 15, 2025·3 cites

Evaluating the Performance and Robustness of LLMs in Materials Science Q&A and Property Predictions

Hongchen Wang, Kangming Li, Scott Ramsay, Yao Fehlis, Edward Kim, and Jason Hattrick-Simpers

PDF

Open Access

TL;DR

This paper evaluates the performance and robustness of large language models in materials science tasks, revealing their strengths and vulnerabilities across question answering and property prediction under various conditions.

Contribution

It provides a comprehensive assessment of LLMs in materials science, highlighting their behaviors, limitations, and potential for improvement in domain-specific applications.

Findings

01

LLMs perform variably across different materials science tasks.

02

Robustness of LLMs is challenged by adversarial and noisy inputs.

03

Unique phenomena like mode collapse and performance recovery are observed.

Abstract

Large Language Models (LLMs) have the potential to revolutionize scientific research, yet their robustness and reliability in domain-specific applications remain insufficiently explored. In this study, we evaluate the performance and robustness of LLMs for materials science, focusing on domain-specific question answering and materials property prediction across diverse real-world and adversarial conditions. Three distinct datasets are used in this study: 1) a set of multiple-choice questions from undergraduate-level materials science courses, 2) a dataset including various steel compositions and yield strengths, and 3) a band gap dataset, containing textual descriptions of material crystal structures and band gap values. The performance of LLMs is assessed using various prompting strategies, including zero-shot chain-of-thought, expert prompting, and few-shot in-context learning. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBig Data and Business Intelligence · Data Quality and Management

MethodsSparse Evolutionary Training