Generative-Evaluative Agreement: A Necessary Validity Criterion for LLM-Enabled Adaptive Assessment

Grandee Lee; Yue Wang; Che Yee Lye; Luke Peh

arXiv:2605.19529·cs.AI·May 20, 2026

Generative-Evaluative Agreement: A Necessary Validity Criterion for LLM-Enabled Adaptive Assessment

Grandee Lee, Yue Wang, Che Yee Lye, Luke Peh

PDF

TL;DR

This paper introduces Generative-Evaluative Agreement (GEA), a new validity criterion for LLM-based assessments, measuring if scoring aligns with the skills the model was instructed to generate, revealing strengths and limitations.

Contribution

It proposes GEA as a novel validity measure for LLM assessments and evaluates its effectiveness across different skill types and biases.

Findings

01

GEA recovers about 50% of the intended variance with positive bias

02

Strong GEA (r > 0.7) for syntactically verifiable skills

03

Low GEA near zero for design-level skills

Abstract

When the same LLM generates assessment items, simulates student responses, and scores them, the validation loop is self-referential. We introduce Generative-Evaluative Agreement (GEA), a validity criterion measuring whether an LLM's scoring function recovers the skill levels its generative function was instructed to produce. In the first direct measurement of GEA on a two-stage adaptive assessment, the model recovers roughly half the intended variance r = 0.698 with systematic positive bias. GEA is strong r > 0.7 for syntactically verifiable skills but near zero for design-level skills, and low-skill overestimation inflates scores near the routing threshold. We argue that granular, skill-decomposed rubrics are the principal proposed mechanism for strengthening GEA and outline complementary mitigations.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.