Leveraging Web-Crawled Data for High-Quality Fine-Tuning
Jing Zhou, Chenglin Jiang, Wei Shen, Xiao Zhou, Xiaonan He

TL;DR
This paper presents a method to convert web-crawled data into high-quality training data for fine-tuning language models, improving performance especially in domain-specific tasks like Chinese math problems.
Contribution
It introduces an automatic dataset alignment technique that leverages web data for effective fine-tuning, outperforming models trained on only high-quality data.
Findings
Model trained on transformed web data outperforms high-quality data only.
Achieves 9.4% higher scores in Chinese math problems.
7B model surpasses larger open-source and some closed-source models.
Abstract
Most large language models are fine-tuned using either expensive human-annotated data or GPT-4 generated data which cannot guarantee performance in certain domains. We argue that although the web-crawled data often has formatting errors causing semantic inaccuracies, it can still serve as a valuable source for high-quality supervised fine-tuning in specific domains without relying on advanced models like GPT-4. To this end, we create a paired training dataset automatically by aligning web-crawled data with a smaller set of high-quality data. By training a language model on this dataset, we can convert web data with irregular formats into high-quality ones. Our experiments show that training with the model-transformed data yields better results, surpassing training with only high-quality data by an average score of 9.4% in Chinese math problems. Additionally, our 7B model outperforms…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsDistributed and Parallel Computing Systems · Parallel Computing and Optimization Techniques · Web Data Mining and Analysis
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Sparse Evolutionary Training · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing · Transformer · Cosine Annealing · Weight Decay · Adam
