Fine-tuning for Better Few Shot Prompting: An Empirical Comparison for Short Answer Grading

Joel Walsh; Siddarth Mamidanna; Benjamin Nye; Mark Core; and Daniel Auerbach

arXiv:2508.04063·cs.LG·August 7, 2025

Fine-tuning for Better Few Shot Prompting: An Empirical Comparison for Short Answer Grading

Joel Walsh, Siddarth Mamidanna, Benjamin Nye, Mark Core, and Daniel Auerbach

PDF

TL;DR

This paper empirically compares fine-tuning and prompt engineering methods for automated short answer grading using LLMs, revealing that fine-tuning can outperform few-shot prompting in certain models and conditions.

Contribution

It evaluates the effectiveness of fine-tuning versus few-shot prompting for short answer grading, especially with open-weight models and synthetic data augmentation.

Findings

01

Fine-tuning has limited benefits for Llama models with small data.

02

Fine-tuning can outperform few-shot prompting in OpenAI's models.

03

Synthetic data significantly improves Llama 3.1 8B-Instruct performance.

Abstract

Research to improve Automated Short Answer Grading has recently focused on Large Language Models (LLMs) with prompt engineering and no- or few-shot prompting to achieve best results. This is in contrast to the fine-tuning approach, which has historically required large-scale compute clusters inaccessible to most users. New closed-model approaches such as OpenAI's fine-tuning service promise results with as few as 100 examples, while methods using open weights such as quantized low-rank adaptive (QLORA) can be used to fine-tune models on consumer GPUs. We evaluate both of these fine-tuning methods, measuring their interaction with few-shot prompting for automated short answer grading (ASAG) with structured (JSON) outputs. Our results show that finetuning with small amounts of data has limited utility for Llama open-weight models, but that fine-tuning methods can outperform few-shot…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.