Beyond Accuracy: Towards a Robust Evaluation Methodology for AI Systems for Language Education
James Edgell, Wm. Matthew Kennedy, Isaac Pattis, Ben Knight, Danielle Carvalho, Elizabeth Wonnacott

TL;DR
This paper introduces L2-Bench, a comprehensive evaluation benchmark for AI systems in language education, integrating pedagogical theory and expert-curated data to improve assessment of AI pedagogical effectiveness.
Contribution
It presents a novel, holistic evaluation methodology and benchmark for AI in language education, grounded in pedagogical theory and expert-curated datasets.
Findings
Authentic task-response pairs rated highly by experts
Lower inter-annotator agreement despite internal consistency
Pilot validation shows potential for iterative improvement
Abstract
The rapid adoption of large language models in AI-powered language education has created an urgent need for evaluations that assess pedagogical effectiveness, particularly in language learning--one of the most common LLM use cases (Tamkin et al. 2024, Costa-Gomes et al. 2025). With only narrowly defined task-specific evaluations of AI system capabilities in second language (L2) education existing in the literature, we require more holistic approaches in this AI for education space. To address this gap, we introduce L2-Bench, a novel evaluation benchmark grounded in a validated "language learning experience designer" construct to assess AI capabilities across L2 education contexts. Our methodology integrates pedagogical theory, sociotechnical AI evaluation methods, and operationalizes a hierarchical taxonomy to structure an expert-curated dataset of over 1,000 authentic rubric-scored…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Explainable Artificial Intelligence (XAI) · Intelligent Tutoring Systems and Adaptive Learning
