DecorateLM: Data Engineering through Corpus Rating, Tagging, and Editing   with Language Models

Ranchi Zhao; Zhen Leng Thai; Yifan Zhang; Shengding Hu; Yunqi Ba; Jie; Zhou; Jie Cai; Zhiyuan Liu; Maosong Sun

arXiv:2410.05639·cs.CL·October 11, 2024

DecorateLM: Data Engineering through Corpus Rating, Tagging, and Editing with Language Models

Ranchi Zhao, Zhen Leng Thai, Yifan Zhang, Shengding Hu, Yunqi Ba, Jie, Zhou, Jie Cai, Zhiyuan Liu, Maosong Sun

PDF

Open Access 1 Video

TL;DR

DecorateLM is a data engineering framework that improves large language model training by rating, tagging, and editing pretraining data, leading to enhanced model performance.

Contribution

The paper introduces DecorateLM, a novel method for refining large-scale pretraining corpora using a small language model trained for data quality enhancement.

Findings

01

High-quality data improves LLM performance

02

DecorateLM effectively enhances 100 billion tokens

03

Curated data boosts downstream model accuracy

Abstract

The performance of Large Language Models (LLMs) is substantially influenced by the pretraining corpus, which consists of vast quantities of unsupervised data processed by the models. Despite its critical role in model performance, ensuring the quality of this data is challenging due to its sheer volume and the absence of sample-level quality annotations and enhancements. In this paper, we introduce DecorateLM, a data engineering method designed to refine the pretraining corpus through data rating, tagging and editing. Specifically, DecorateLM rates texts against quality criteria, tags texts with hierarchical labels, and edits texts into a more formalized format. Due to the massive size of the pretraining corpus, adopting an LLM for decorating the entire corpus is less efficient. Therefore, to balance performance with efficiency, we curate a meticulously annotated training corpus for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

DecorateLM: Data Engineering through Corpus Rating, Tagging, and Editing with Language Models· underline

Taxonomy

TopicsNatural Language Processing Techniques · Semantic Web and Ontologies