Fine-tuning for Better Few Shot Prompting: An Empirical Comparison for Short Answer Grading
Joel Walsh, Siddarth Mamidanna, Benjamin Nye, Mark Core, and Daniel Auerbach

TL;DR
This paper empirically compares fine-tuning and prompt engineering methods for automated short answer grading using LLMs, revealing that fine-tuning can outperform few-shot prompting in certain models and conditions.
Contribution
It evaluates the effectiveness of fine-tuning versus few-shot prompting for short answer grading, especially with open-weight models and synthetic data augmentation.
Findings
Fine-tuning has limited benefits for Llama models with small data.
Fine-tuning can outperform few-shot prompting in OpenAI's models.
Synthetic data significantly improves Llama 3.1 8B-Instruct performance.
Abstract
Research to improve Automated Short Answer Grading has recently focused on Large Language Models (LLMs) with prompt engineering and no- or few-shot prompting to achieve best results. This is in contrast to the fine-tuning approach, which has historically required large-scale compute clusters inaccessible to most users. New closed-model approaches such as OpenAI's fine-tuning service promise results with as few as 100 examples, while methods using open weights such as quantized low-rank adaptive (QLORA) can be used to fine-tune models on consumer GPUs. We evaluate both of these fine-tuning methods, measuring their interaction with few-shot prompting for automated short answer grading (ASAG) with structured (JSON) outputs. Our results show that finetuning with small amounts of data has limited utility for Llama open-weight models, but that fine-tuning methods can outperform few-shot…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
