Analyzing Large Language Models for Classroom Discussion Assessment
Nhat Tran, Benjamin Pierce, Diane Litman, Richard Correnti, Lindsay, Clare Matsumura

TL;DR
This paper evaluates how large language models can assess classroom discussions, analyzing the impact of task formulation, context length, and few-shot examples on performance, and balancing accuracy with efficiency and consistency.
Contribution
It provides an empirical analysis of factors influencing LLM-based assessment performance and offers recommendations for effective, efficient, and consistent evaluation methods.
Findings
Task formulation affects assessment accuracy.
Context length influences model performance.
Consistency correlates with predictive accuracy.
Abstract
Automatically assessing classroom discussion quality is becoming increasingly feasible with the help of new NLP advancements such as large language models (LLMs). In this work, we examine how the assessment performance of 2 LLMs interacts with 3 factors that may affect performance: task formulation, context length, and few-shot examples. We also explore the computational efficiency and predictive consistency of the 2 LLMs. Our results suggest that the 3 aforementioned factors do affect the performance of the tested LLMs and there is a relation between consistency and performance. We recommend a LLM-based assessment approach that has a good balance in terms of predictive performance, computational efficiency, and consistency.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEducational Technology and Assessment
