Measuring Teaching with LLMs

Michael Hardy

arXiv:2510.22968·cs.CL·November 7, 2025

Measuring Teaching with LLMs

Michael Hardy

PDF

Open Access

TL;DR

This paper introduces specialized sentence-embedding based LLMs for scalable, reliable, and valid assessment of teaching quality, outperforming human ratings and aligning with student learning outcomes.

Contribution

It presents a novel architecture using sentence-level embeddings for classroom transcripts, achieving super-human performance in teaching quality measurement.

Findings

01

Models achieve human-level and super-human correlation with expert ratings.

02

Advanced models attribute scores to lesson-level features, not just isolated utterances.

03

Aggregate scores correlate with student learning measures, indicating external validity.

Abstract

Objective and scalable measurement of teaching quality is a persistent challenge in education. While Large Language Models (LLMs) offer potential, general-purpose models have struggled to reliably apply complex, authentic classroom observation instruments. This paper uses custom LLMs built on sentence-level embeddings, an architecture better suited for the long-form, interpretive nature of classroom transcripts than conventional subword tokenization. We systematically evaluate five different sentence embeddings under a data-efficient training regime designed to prevent overfitting. Our results demonstrate that these specialized models can achieve human-level and even super-human performance with expert human ratings above 0.65 and surpassing the average human-human rater correlation. Further, through analysis of annotation context windows, we find that more advanced models-those better…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning · Online Learning and Analytics · Topic Modeling