TL;DR
This study compares fine-tuned BERT classifiers with few-shot prompting GPT-4 models for assessing open responses in equity-focused tutor training, finding BERT more effective and resource-efficient in nuanced tasks.
Contribution
It demonstrates that fine-tuning BERT outperforms GPT-4 few-shot prompting in complex, nuanced assessment tasks related to equity training.
Findings
BERT outperforms GPT-4 in accuracy for open-response assessment.
Fine-tuning BERT is more resource-efficient than prompting GPT-4.
GPT-4 models struggle with nuanced, explanation-based responses.
Abstract
Assessing learners in ill-defined domains, such as scenario-based human tutoring training, is an area of limited research. Equity training requires a nuanced understanding of context, but do contemporary large language models (LLMs) have a knowledge base that can navigate these nuances? Legacy transformer models like BERT, in contrast, have less real-world knowledge but can be more easily fine-tuned than commercial LLMs. Here, we study whether fine-tuning BERT on human annotations outperforms state-of-the-art LLMs (GPT-4o and GPT-4-Turbo) with few-shot prompting and instruction. We evaluate performance on four prediction tasks involving generating and explaining open-ended responses in advocacy-focused training lessons in a higher education student population learning to become middle school tutors. Leveraging a dataset of 243 human-annotated open responses from tutor training lessons,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
