Automated Data Curation for Robust Language Model Fine-Tuning
Jiuhai Chen, Jonas Mueller

TL;DR
This paper presents CLEAR, an automated data curation pipeline that improves fine-tuned language models by filtering and correcting training data based on confidence estimates, enhancing model performance without extra fine-tuning.
Contribution
The paper introduces CLEAR, a novel, data-centric framework for dataset curation that enhances LLM fine-tuning without requiring stronger models or additional training.
Findings
CLEAR improves model performance across multiple datasets and models
It effectively filters and corrects low-quality training data
The approach does not require access to stronger LLMs for data curation
Abstract
Large Language Models have become the de facto approach to sequence-to-sequence text generation tasks, but for specialized tasks/domains, a pretrained LLM lacks specific capabilities to produce accurate or well-formatted responses. Supervised fine-tuning specializes a LLM by training it on dataset of example prompts with target responses, but real-world data tends to be noisy. While many fine-tuning algorithms exist, here we consider a \emph{data-centric AI} perspective on LLM fine-tuning, studying how to \emph{systematically} curate the training dataset to improve the LLM produced via \emph{any} fine-tuning algorithm. We introduce an automated data curation pipeline CLEAR (Confidence-based LLM Evaluation And Rectification) for instruction tuning datasets, that can be used with any LLM and fine-tuning procedure. CLEAR estimates which training data is low-quality and either filters or…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Semantic Web and Ontologies
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · {Dispute@FaQ-s}How to file a dispute with Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Absolute Position Encodings · Linear Layer · Position-Wise Feed-Forward Layer · Label Smoothing · Transformer · Softmax
