An Expert Schema for Evaluating Large Language Model Errors in Scholarly Question-Answering Systems

Anna Martin-Boyle; William Humphreys; Martha Brown; Cara Leckey; Harmanpreet Kaur

arXiv:2602.21059·cs.HC·February 25, 2026

An Expert Schema for Evaluating Large Language Model Errors in Scholarly Question-Answering Systems

Anna Martin-Boyle, William Humphreys, Martha Brown, Cara Leckey, Harmanpreet Kaur

PDF

Open Access

TL;DR

This paper introduces a structured schema for evaluating errors in large language models used in scholarly question-answering, aligning with expert assessment practices to improve reliability and detection of issues.

Contribution

The paper presents a validated error evaluation schema based on expert input, enhancing the assessment of LLM outputs in scholarly contexts.

Findings

01

Identified 20 error patterns across 7 categories through expert analysis.

02

Validated the schema with 10 scientists, showing improved error detection.

03

Schema supports personalized and context-aware evaluation tools.

Abstract

Large Language Models (LLMs) are transforming scholarly tasks like search and summarization, but their reliability remains uncertain. Current evaluation metrics for testing LLM reliability are primarily automated approaches that prioritize efficiency and scalability, but lack contextual nuance and fail to reflect how scientific domain experts assess LLM outputs in practice. We developed and validated a schema for evaluating LLM errors in scholarly question-answering systems that reflects the assessment strategies of practicing scientists. In collaboration with domain experts, we identified 20 error patterns across seven categories through thematic analysis of 68 question-answer pairs. We validated this schema through contextual inquiries with 10 additional scientists, which showed not only which errors experts naturally identify but also how structured evaluation schemas can help them…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Expert finding and Q&A systems · Computational and Text Analysis Methods