RedStone: Curating General, Code, Math, and QA Data for Large Language Models
Yaoyao Chang, Lei Cui, Li Dong, Shaohan Huang, Yangyu Huang, Yupan, Huang, Scarlett Li, Tengchao Lv, Shuming Ma, Qinzheng Sun, Wenhui Wang, Furu, Wei, Ying Xin, Mao Yang, Qiufeng Yin, Xingxing Zhang

TL;DR
RedStone is a scalable pipeline that extracts diverse, high-quality datasets from Common Crawl, enabling efficient pre-training of large language models across general, code, math, and QA domains, thus broadening their capabilities.
Contribution
This work introduces RedStone, a novel pipeline that leverages Common Crawl for creating extensive, domain-specific pre-training datasets, reducing curation costs and enhancing LLM versatility.
Findings
Common Crawl can be effectively used for diverse pre-training datasets.
RedStone enables easy adaptation to multiple domains.
Pre-training with RedStone improves LLM performance across tasks.
Abstract
Pre-training Large Language Models (LLMs) on high-quality, meticulously curated datasets is widely recognized as critical for enhancing their performance and generalization capabilities. This study explores the untapped potential of Common Crawl as a comprehensive and flexible resource for pre-training LLMs, addressing both general-purpose language understanding and specialized domain knowledge. We introduce RedStone, an innovative and scalable pipeline engineered to extract and process data from Common Crawl, facilitating the creation of extensive and varied pre-training datasets. Unlike traditional datasets, which often require expensive curation and domain-specific expertise, RedStone leverages the breadth of Common Crawl to deliver datasets tailored to a wide array of domains. In this work, we exemplify its capability by constructing pre-training datasets across multiple fields,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
