Beyond Accuracy: Towards a Robust Evaluation Methodology for AI Systems for Language Education

James Edgell; Wm. Matthew Kennedy; Isaac Pattis; Ben Knight; Danielle Carvalho; Elizabeth Wonnacott

arXiv:2603.20088·cs.CY·March 23, 2026

Beyond Accuracy: Towards a Robust Evaluation Methodology for AI Systems for Language Education

James Edgell, Wm. Matthew Kennedy, Isaac Pattis, Ben Knight, Danielle Carvalho, Elizabeth Wonnacott

PDF

Open Access

TL;DR

This paper introduces L2-Bench, a comprehensive evaluation benchmark for AI systems in language education, integrating pedagogical theory and expert-curated data to improve assessment of AI pedagogical effectiveness.

Contribution

It presents a novel, holistic evaluation methodology and benchmark for AI in language education, grounded in pedagogical theory and expert-curated datasets.

Findings

01

Authentic task-response pairs rated highly by experts

02

Lower inter-annotator agreement despite internal consistency

03

Pilot validation shows potential for iterative improvement

Abstract

The rapid adoption of large language models in AI-powered language education has created an urgent need for evaluations that assess pedagogical effectiveness, particularly in language learning--one of the most common LLM use cases (Tamkin et al. 2024, Costa-Gomes et al. 2025). With only narrowly defined task-specific evaluations of AI system capabilities in second language (L2) education existing in the literature, we require more holistic approaches in this AI for education space. To address this gap, we introduce L2-Bench, a novel evaluation benchmark grounded in a validated "language learning experience designer" construct to assess AI capabilities across L2 education contexts. Our methodology integrates pedagogical theory, sociotechnical AI evaluation methods, and operationalizes a hierarchical taxonomy to structure an expert-curated dataset of over 1,000 authentic rubric-scored…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Explainable Artificial Intelligence (XAI) · Intelligent Tutoring Systems and Adaptive Learning