Generative-Evaluative Agreement: A Necessary Validity Criterion for LLM-Enabled Adaptive Assessment
Grandee Lee, Yue Wang, Che Yee Lye, Luke Peh

TL;DR
This paper introduces Generative-Evaluative Agreement (GEA), a new validity criterion for LLM-based assessments, measuring if scoring aligns with the skills the model was instructed to generate, revealing strengths and limitations.
Contribution
It proposes GEA as a novel validity measure for LLM assessments and evaluates its effectiveness across different skill types and biases.
Findings
GEA recovers about 50% of the intended variance with positive bias
Strong GEA (r > 0.7) for syntactically verifiable skills
Low GEA near zero for design-level skills
Abstract
When the same LLM generates assessment items, simulates student responses, and scores them, the validation loop is self-referential. We introduce Generative-Evaluative Agreement (GEA), a validity criterion measuring whether an LLM's scoring function recovers the skill levels its generative function was instructed to produce. In the first direct measurement of GEA on a two-stage adaptive assessment, the model recovers roughly half the intended variance r = 0.698 with systematic positive bias. GEA is strong r > 0.7 for syntactically verifiable skills but near zero for design-level skills, and low-skill overestimation inflates scores near the routing threshold. We argue that granular, skill-decomposed rubrics are the principal proposed mechanism for strengthening GEA and outline complementary mitigations.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
