Benchmarking the Pedagogical Knowledge of Large Language Models

Maxime Leli\`evre; Amy Waldock; Meng Liu; Natalia Vald\'es Aspillaga; Alasdair Mackintosh; Mar\'ia Jos\'e Ogando Portela; Jared Lee; Paul Atherton; Robin A. A. Ince; Oliver G. B. Garrod

arXiv:2506.18710·cs.CL·July 2, 2025

Benchmarking the Pedagogical Knowledge of Large Language Models

Maxime Leli\`evre, Amy Waldock, Meng Liu, Natalia Vald\'es Aspillaga, Alasdair Mackintosh, Mar\'ia Jos\'e Ogando Portela, Jared Lee, Paul Atherton, Robin A. A. Ince, Oliver G. B. Garrod

PDF

1 Datasets

TL;DR

This paper introduces a new benchmark dataset to evaluate large language models' understanding of pedagogy, addressing a gap in existing knowledge assessments by focusing on teaching strategies and educational needs.

Contribution

It presents The Pedagogy Benchmark, a novel dataset for assessing models' pedagogical knowledge, sourced from professional exams, with comprehensive methodology and online leaderboards.

Findings

01

Models' accuracy ranges from 28% to 89% on pedagogical questions.

02

Analysis of cost versus accuracy and progression of the Pareto frontier.

03

Online leaderboards enable interactive exploration of model performance.

Abstract

Benchmarks like Massive Multitask Language Understanding (MMLU) have played a pivotal role in evaluating AI's knowledge and abilities across diverse domains. However, existing benchmarks predominantly focus on content knowledge, leaving a critical gap in assessing models' understanding of pedagogy - the method and practice of teaching. This paper introduces The Pedagogy Benchmark, a novel dataset designed to evaluate large language models on their Cross-Domain Pedagogical Knowledge (CDPK) and Special Education Needs and Disability (SEND) pedagogical knowledge. These benchmarks are built on a carefully curated set of questions sourced from professional development exams for teachers, which cover a range of pedagogical subdomains such as teaching strategies and assessment methods. Here we outline the methodology and development of these benchmarks. We report results for 97 models, with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

AI-for-Education/pedagogy-benchmark
dataset· 204 dl
204 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.