MatTools: Benchmarking Large Language Models for Materials Science Tools
Siyu Liu, Bo Hu, Beilin Ye, Jiamin Xu, David J. Srolovitz, Tongqi Wen

TL;DR
MatTools introduces a comprehensive benchmark to evaluate large language models' ability to understand and generate code for materials science applications, combining QA and real-world tool usage assessments.
Contribution
This work presents a novel benchmark framework, including a large QA dataset and real-world code generation tasks, for evaluating LLMs in materials science contexts.
Findings
Generalist LLMs outperform specialists
AI models are aware of other AI models
Simpler models perform better in this domain
Abstract
Large language models (LLMs) are increasingly applied to materials science questions, including literature comprehension, property prediction, materials discovery and alloy design. At the same time, a wide range of physics-based computational approaches have been developed in which materials properties can be calculated. Here, we propose a benchmark application to evaluate the proficiency of LLMs to answer materials science questions through the generation and safe execution of codes based on such physics-based computational materials science packages. MatTools is built on two complementary components: a materials simulation tool question-answer (QA) benchmark and a real-world tool-usage benchmark. We designed an automated methodology to efficiently collect real-world materials science tool-use examples. The QA benchmark, derived from the pymatgen (Python Materials Genomics) codebase…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Artificial Intelligence in Healthcare and Education · Inorganic Chemistry and Materials
