
TL;DR
This paper introduces specialized sentence-embedding based LLMs for scalable, reliable, and valid assessment of teaching quality, outperforming human ratings and aligning with student learning outcomes.
Contribution
It presents a novel architecture using sentence-level embeddings for classroom transcripts, achieving super-human performance in teaching quality measurement.
Findings
Models achieve human-level and super-human correlation with expert ratings.
Advanced models attribute scores to lesson-level features, not just isolated utterances.
Aggregate scores correlate with student learning measures, indicating external validity.
Abstract
Objective and scalable measurement of teaching quality is a persistent challenge in education. While Large Language Models (LLMs) offer potential, general-purpose models have struggled to reliably apply complex, authentic classroom observation instruments. This paper uses custom LLMs built on sentence-level embeddings, an architecture better suited for the long-form, interpretive nature of classroom transcripts than conventional subword tokenization. We systematically evaluate five different sentence embeddings under a data-efficient training regime designed to prevent overfitting. Our results demonstrate that these specialized models can achieve human-level and even super-human performance with expert human ratings above 0.65 and surpassing the average human-human rater correlation. Further, through analysis of annotation context windows, we find that more advanced models-those better…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning · Online Learning and Analytics · Topic Modeling
