DecorateLM: Data Engineering through Corpus Rating, Tagging, and Editing with Language Models
Ranchi Zhao, Zhen Leng Thai, Yifan Zhang, Shengding Hu, Yunqi Ba, Jie, Zhou, Jie Cai, Zhiyuan Liu, Maosong Sun

TL;DR
DecorateLM is a data engineering framework that improves large language model training by rating, tagging, and editing pretraining data, leading to enhanced model performance.
Contribution
The paper introduces DecorateLM, a novel method for refining large-scale pretraining corpora using a small language model trained for data quality enhancement.
Findings
High-quality data improves LLM performance
DecorateLM effectively enhances 100 billion tokens
Curated data boosts downstream model accuracy
Abstract
The performance of Large Language Models (LLMs) is substantially influenced by the pretraining corpus, which consists of vast quantities of unsupervised data processed by the models. Despite its critical role in model performance, ensuring the quality of this data is challenging due to its sheer volume and the absence of sample-level quality annotations and enhancements. In this paper, we introduce DecorateLM, a data engineering method designed to refine the pretraining corpus through data rating, tagging and editing. Specifically, DecorateLM rates texts against quality criteria, tags texts with hierarchical labels, and edits texts into a more formalized format. Due to the massive size of the pretraining corpus, adopting an LLM for decorating the entire corpus is less efficient. Therefore, to balance performance with efficiency, we curate a meticulously annotated training corpus for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Semantic Web and Ontologies
