ELI-Why: Evaluating the Pedagogical Utility of Language Model Explanations

Brihi Joshi; Keyu He; Sahana Ramnath; Sadra Sabouri; Kaitlyn Zhou; Souti Chattopadhyay; Swabha Swayamdipta; Xiang Ren

arXiv:2506.14200·cs.CL·June 18, 2025

ELI-Why: Evaluating the Pedagogical Utility of Language Model Explanations

Brihi Joshi, Keyu He, Sahana Ramnath, Sadra Sabouri, Kaitlyn Zhou, Souti Chattopadhyay, Swabha Swayamdipta, Xiang Ren

PDF

Open Access

TL;DR

This paper introduces ELI-Why, a benchmark for evaluating language models' ability to generate educational explanations tailored to different learning levels, revealing current models' limitations in pedagogical utility.

Contribution

The paper presents ELI-Why, a new benchmark with extensive human studies to assess and quantify the pedagogical effectiveness of language model explanations across educational levels.

Findings

01

GPT-4 explanations match intended educational levels only 50% of the time.

02

Lay human explanations match educational levels 79% of the time.

03

Automated metrics cannot distinguish explanations across different educational levels.

Abstract

Language models today are widely used in education, yet their ability to tailor responses for learners with varied informational needs and knowledge backgrounds remains under-explored. To this end, we introduce ELI-Why, a benchmark of 13.4K "Why" questions to evaluate the pedagogical capabilities of language models. We then conduct two extensive human studies to assess the utility of language model-generated explanatory answers (explanations) on our benchmark, tailored to three distinct educational grades: elementary, high-school and graduate school. In our first study, human raters assume the role of an "educator" to assess model explanations' fit to different educational grades. We find that GPT-4-generated explanations match their intended educational background only 50% of the time, compared to 79% for lay human-curated explanations. In our second study, human raters assume the role…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques