Judging the Judges: Human Validation of Multi-LLM Evaluation for High-Quality K--12 Science Instructional Materials

Peng He; Zhaohui Li; Zeyuan Wang; Jinjun Xiong; Tingting Li

arXiv:2602.13243·cs.CY·February 17, 2026

Judging the Judges: Human Validation of Multi-LLM Evaluation for High-Quality K--12 Science Instructional Materials

Peng He, Zhaohui Li, Zeyuan Wang, Jinjun Xiong, Tingting Li

PDF

Open Access

TL;DR

This study evaluates how human experts interpret AI-generated assessments of K--12 science instructional materials, aiming to inform the development of a GenAI tool for designing high-quality, standards-aligned educational content.

Contribution

It analyzes expert review patterns of LLM evaluations to identify strengths, gaps, and nuances, guiding the creation of a domain-specific GenAI instructional design agent.

Findings

01

Experts identify key reasoning strengths and gaps in LLM evaluations.

02

Patterns of agreement and disagreement reveal contextual nuances in AI judgments.

03

Insights will inform the development of a specialized GenAI tool for instructional design.

Abstract

Designing high-quality, standards-aligned instructional materials for K--12 science is time-consuming and expertise-intensive. This study examines what human experts notice when reviewing AI-generated evaluations of such materials, aiming to translate their insights into design principles for a future GenAI-based instructional material design agent. We intentionally selected 12 high-quality curriculum units across life, physical, and earth sciences from validated programs such as OpenSciEd and Multiple Literacies in Project-based Learning. Using the EQuIP rubric with 9 evaluation items, we prompted GPT-4o, Claude, and Gemini to produce numerical ratings and written rationales for each unit, generating 648 evaluation outputs. Two science education experts independently reviewed all outputs, marking agreement (1) or disagreement (0) for both scores and rationales, and offering qualitative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTeaching and Learning Programming · Science Education and Pedagogy · Intelligent Tutoring Systems and Adaptive Learning