From Test-taking to Cognitive Scaffolding: A Pedagogical Diagnostic Benchmark for LLMs on English Standardized Tests

Luoxi Tang; Tharunya Sundar; Yuqiao Meng; Shuai Yang; Ankita Patra; Lakshmi Manohar Chippada; Jiqian Zhao; Yi Li; Weicheng Ma; Zhaohan Xi

arXiv:2505.17056·cs.CL·May 1, 2026

From Test-taking to Cognitive Scaffolding: A Pedagogical Diagnostic Benchmark for LLMs on English Standardized Tests

Luoxi Tang, Tharunya Sundar, Yuqiao Meng, Shuai Yang, Ankita Patra, Lakshmi Manohar Chippada, Jiqian Zhao, Yi Li, Weicheng Ma, Zhaohan Xi

PDF

TL;DR

This paper introduces ESTBook, a comprehensive benchmark for evaluating LLMs on English standardized tests, emphasizing reasoning, misconceptions, and cognitive trajectories to improve educational AI tools.

Contribution

It presents a novel pedagogical diagnostic framework and a large multimodal benchmark that enriches questions with reasoning paths and misconceptions, advancing educational AI evaluation.

Findings

01

Identifying cognitive trajectories helps reduce performance gaps.

02

Enriching questions with reasoning and distractors improves pedagogical reasoning.

03

The framework demonstrates practical utility in educational contexts.

Abstract

As large language models (LLMs) are increasingly integrated into educational tools, current evaluations on standardized tests predominantly focus on binary outcome accuracy. Instead, an effective AI tutor must exhibit faithful reasoning, elucidate solution strategies, and diagnose specific human misconceptions. To bridge this gap, we introduce a pedagogical diagnostic framework that models English Standardized Test (EST) problem-solving as a traversal through a cognitive framework. Based on this framework, we present ESTBook, a multimodal benchmark encompassing 10,576 questions and 29 task types across five major exams. Unlike traditional datasets, ESTBook goes beyond data aggregation by enriching questions with formalized reasoning trajectories and distractor rationales that capture specific cognitive traps. Through extensive evaluations, we empirically demonstrate the practical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.