Benchmarking the Pedagogical Knowledge of Large Language Models
Maxime Leli\`evre, Amy Waldock, Meng Liu, Natalia Vald\'es Aspillaga, Alasdair Mackintosh, Mar\'ia Jos\'e Ogando Portela, Jared Lee, Paul Atherton, Robin A. A. Ince, Oliver G. B. Garrod

TL;DR
This paper introduces a new benchmark dataset to evaluate large language models' understanding of pedagogy, addressing a gap in existing knowledge assessments by focusing on teaching strategies and educational needs.
Contribution
It presents The Pedagogy Benchmark, a novel dataset for assessing models' pedagogical knowledge, sourced from professional exams, with comprehensive methodology and online leaderboards.
Findings
Models' accuracy ranges from 28% to 89% on pedagogical questions.
Analysis of cost versus accuracy and progression of the Pareto frontier.
Online leaderboards enable interactive exploration of model performance.
Abstract
Benchmarks like Massive Multitask Language Understanding (MMLU) have played a pivotal role in evaluating AI's knowledge and abilities across diverse domains. However, existing benchmarks predominantly focus on content knowledge, leaving a critical gap in assessing models' understanding of pedagogy - the method and practice of teaching. This paper introduces The Pedagogy Benchmark, a novel dataset designed to evaluate large language models on their Cross-Domain Pedagogical Knowledge (CDPK) and Special Education Needs and Disability (SEND) pedagogical knowledge. These benchmarks are built on a carefully curated set of questions sourced from professional development exams for teachers, which cover a range of pedagogical subdomains such as teaching strategies and assessment methods. Here we outline the methodology and development of these benchmarks. We report results for 97 models, with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
