Expert Evaluation of LLM World Models: A High-$T_c$ Superconductivity Case Study

Haoyu Guo; Maria Tikhanovskaya; Paul Raccuglia; Alexey Vlaskin; Chris Co; Daniel J. Liebling; Scott Ellsworth; Matthew Abraham; Elizabeth Dorfman; N. P. Armitage; Chunhan Feng; Antoine Georges; Olivier Gingras; Dominik Kiese; Steven A. Kivelson; Vadim Oganesyan; B. J. Ramshaw; Subir Sachdev; T. Senthil; J. M. Tranquada; Michael P. Brenner; Subhashini Venugopalan; Eun-Ah Kim

arXiv:2511.03782·cond-mat.supr-con·March 12, 2026

Expert Evaluation of LLM World Models: A High-$T_c$ Superconductivity Case Study

Haoyu Guo, Maria Tikhanovskaya, Paul Raccuglia, Alexey Vlaskin, Chris Co, Daniel J. Liebling, Scott Ellsworth, Matthew Abraham, Elizabeth Dorfman, N. P. Armitage, Chunhan Feng, Antoine Georges, Olivier Gingras, Dominik Kiese, Steven A. Kivelson, Vadim Oganesyan, B. J. Ramshaw

PDF

TL;DR

This study assesses the ability of various large language models to understand and answer complex, expert-level questions about high-temperature superconductivity literature, highlighting strengths and limitations of current systems.

Contribution

It introduces a benchmark with expert-curated questions and an evaluation rubric for assessing LLMs' scientific understanding in a specialized domain.

Findings

01

RAG-based LLM systems outperform closed models in comprehensive answering

02

Expert evaluation highlights strengths and weaknesses of current LLMs

03

Benchmark tools facilitate assessment of LLM reasoning in scientific literature

Abstract

Large Language Models (LLMs) show great promise as a powerful tool for scientific literature exploration. However, their effectiveness in providing scientifically accurate and comprehensive answers to complex questions within specialized domains remains an active area of research. Using the field of high-temperature cuprates as an exemplar, we evaluate the ability of LLM systems to understand the literature at the level of an expert. We construct an expert-curated database of 1,726 scientific papers that covers the history of the field, and a set of 67 expert-formulated questions that probe deep understanding of the literature. We then evaluate six different LLM-based systems for answering these questions, including both commercially available closed models and a custom retrieval-augmented generation (RAG) system capable of retrieving images alongside text. Experts then evaluate the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.