SciEval: A Benchmark for Automatic Evaluation of K-12 Science Instructional Materials
Zhaohui Li, Peng He, Zhiyuan Chen, Honglu Liu, Zeyuan Wang, Tingting Li, Jinjun Xiong

TL;DR
This paper introduces SciEval, a benchmark dataset for evaluating AI models' ability to automatically assess K-12 science instructional materials, highlighting the need for domain-specific fine-tuning.
Contribution
The creation of the first AIME dataset with expert annotations and the evaluation of LLMs, demonstrating the benefits of domain-specific fine-tuning for educational assessment.
Findings
Mainstream LLMs perform poorly on SciEval without fine-tuning.
Fine-tuning Qwen3 improves performance by up to 11%.
Expert annotations show high inter-rater reliability.
Abstract
The need to evaluate instructional materials for K-12 science education has become increasingly important, as more educators use generative AI to create instructional materials. However, the review of instructional materials is time-consuming, expertise-intensive, and difficult to scale, motivating interest in automated evaluation approaches. While large language models (LLMs) have shown strong performance on general evaluation tasks, their performance and reliability on instructional materials remain unclear. To address this gap, we formulate Automatic Instructional Materials Evaluation (AIME) as a generative AI task that predicts scores and evidence using the rubric designed by the educator. We create a benchmark dataset and develop baseline models for AIME. First, we curate the first AIME dataset, SciEval, consisting of instructional materials annotated with pedagogy-aligned…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
