MaterialBENCH: Evaluating College-Level Materials Science   Problem-Solving Abilities of Large Language Models

Michiko Yoshitake (1); Yuta Suzuki (2); Ryo Igarashi (1); Yoshitaka; Ushiku (1); Keisuke Nagato (3) ((1) OMRON SINIC X; (2) Osaka Univ.; (3) Univ.; Tokyo)

arXiv:2409.03161·cs.CL·December 2, 2024

MaterialBENCH: Evaluating College-Level Materials Science Problem-Solving Abilities of Large Language Models

Michiko Yoshitake (1), Yuta Suzuki (2), Ryo Igarashi (1), Yoshitaka, Ushiku (1), Keisuke Nagato (3) ((1) OMRON SINIC X, (2) Osaka Univ., (3) Univ., Tokyo)

PDF

Open Access

TL;DR

MaterialBENCH is a new benchmark dataset based on university textbooks designed to evaluate large language models' problem-solving abilities in materials science, covering free-response and multiple-choice questions to assess reasoning skills.

Contribution

The paper introduces MaterialBENCH, a comprehensive dataset for assessing LLMs in materials science problem-solving, and analyzes model performance differences across question types and input methods.

Findings

01

LLMs show varied performance on free-response vs. multiple-choice questions.

02

Using system messages influences LLM performance on multiple-choice problems.

03

MaterialBENCH highlights areas for improving LLM reasoning in materials science.

Abstract

A college-level benchmark dataset for large language models (LLMs) in the materials science field, MaterialBENCH, is constructed. This dataset consists of problem-answer pairs, based on university textbooks. There are two types of problems: one is the free-response answer type, and the other is the multiple-choice type. Multiple-choice problems are constructed by adding three incorrect answers as choices to a correct answer, so that LLMs can choose one of the four as a response. Most of the problems for free-response answer and multiple-choice types overlap except for the format of the answers. We also conduct experiments using the MaterialBENCH on LLMs, including ChatGPT-3.5, ChatGPT-4, Bard (at the time of the experiments), and GPT-3.5 and GPT-4 with the OpenAI API. The differences and similarities in the performance of LLMs measured by the MaterialBENCH are analyzed and discussed.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Cosine Annealing · Absolute Position Encodings · Label Smoothing · Position-Wise Feed-Forward Layer · Residual Connection · Linear Warmup With Cosine Annealing · Transformer · Byte Pair Encoding