Leveraging Web-Crawled Data for High-Quality Fine-Tuning

Jing Zhou; Chenglin Jiang; Wei Shen; Xiao Zhou; Xiaonan He

arXiv:2408.08003·cs.CL·August 16, 2024

Leveraging Web-Crawled Data for High-Quality Fine-Tuning

Jing Zhou, Chenglin Jiang, Wei Shen, Xiao Zhou, Xiaonan He

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper presents a method to convert web-crawled data into high-quality training data for fine-tuning language models, improving performance especially in domain-specific tasks like Chinese math problems.

Contribution

It introduces an automatic dataset alignment technique that leverages web data for effective fine-tuning, outperforming models trained on only high-quality data.

Findings

01

Model trained on transformed web data outperforms high-quality data only.

02

Achieves 9.4% higher scores in Chinese math problems.

03

7B model surpasses larger open-source and some closed-source models.

Abstract

Most large language models are fine-tuned using either expensive human-annotated data or GPT-4 generated data which cannot guarantee performance in certain domains. We argue that although the web-crawled data often has formatting errors causing semantic inaccuracies, it can still serve as a valuable source for high-quality supervised fine-tuning in specific domains without relying on advanced models like GPT-4. To this end, we create a paired training dataset automatically by aligning web-crawled data with a smaller set of high-quality data. By training a language model on this dataset, we can convert web data with irregular formats into high-quality ones. Our experiments show that training with the model-transformed data yields better results, surpassing training with only high-quality data by an average score of 9.4% in Chinese math problems. Additionally, our 7B model outperforms…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhouj8553/Web_to_SFT
pytorchOfficial

Videos

Leveraging Web-Crawled Data for High-Quality Fine-Tuning· underline

Taxonomy

TopicsDistributed and Parallel Computing Systems · Parallel Computing and Optimization Techniques · Web Data Mining and Analysis

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Sparse Evolutionary Training · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing · Transformer · Cosine Annealing · Weight Decay · Adam