TL;DR
This paper presents a method for fine-tuning a 1.7B parameter quantized Small Language Model to serve as a reliable, open-source evaluator and annotator, outperforming proprietary models in alignment and reproducibility.
Contribution
It introduces a novel fine-tuning pipeline for small, quantized LLMs that improves alignment with human annotations and offers a reproducible, privacy-preserving alternative to proprietary models.
Findings
Achieved a 0.23 point increase in Krippendorff's α over state-of-the-art proprietary LLMs.
Demonstrated the approach's effectiveness on a separate emotion classification task.
Provided an open-source implementation at https://github.com/jylee-k/slm-judge.
Abstract
As Large Language Model (LLM) capabilities advance, the demand for high-quality annotation of exponentially increasing text corpora has outpaced human capacity, leading to the widespread adoption of LLMs in automatic evaluation and annotation. However, proprietary LLMs often exhibit systematic biases that diverge from human expert consensus, lacks reproducibility, and raises data privacy concerns. Our work examines the viability of finetuning a quantized Small Language Model of 1.7B parameter size on limited human-annotated data to serve as a highly aligned, deterministic evaluator and annotator. By implementing a custom, multi-dimensional rubric framework and simple augmentation and regularization techniques, the proposed approach achieves higher inter-annotator agreement (0.23 points increase in Krippendorff's ) than the best performing state-of-the-art proprietary LLM. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
