Exploring AI-Enabled Test Practice, Affect, and Test Outcomes in Language Assessment
Jill Burstein, Ramsey Cardwell, Ping-Ling Chuang, Allison Michalowski, Steven Nydick

TL;DR
This large-scale study investigates how AI-generated practice tests influence language test scores, test-taker confidence, and score-sharing behaviors in high-stakes assessments, revealing optimal practice levels and potential washback effects.
Contribution
It is the first extensive study to examine the impact of AI-enabled practice tests on high-stakes language assessment outcomes and test-taker affect.
Findings
1-3 practice tests improve scores and confidence.
More than 3 practice tests may decrease performance.
Positive affect increases likelihood of score sharing.
Abstract
Practice tests for high-stakes assessment are intended to build test familiarity, and reduce construct-irrelevant variance which can interfere with valid score interpretation. Generative AI-driven, automated item generation (AIG) scales the creation of large item banks and multiple practice tests, enabling repeated practice opportunities. We conducted a large-scale observational study (N = 25,969) using the Duolingo English Test (DET) -- a digital, high-stakes, computer-adaptive English language proficiency test to examine how increased access to repeated test practice relates to official DETscores, test-taker affect (e.g., confidence), and score-sharing for university admissions. To our knowledge, this is the first large-scale study exploring the use of AIG-enabled practice tests in high-stakes language assessment. Results showed that taking 1-3 practice tests was associated with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
