Oasis: Data Curation and Assessment System for Pretraining of Large Language Models
Tong Zhou, Yubo Chen, Pengfei Cao, Kang Liu, Jun Zhao, Shengping Liu

TL;DR
Oasis is an integrated platform for data curation and assessment tailored for pretraining large language models, enabling iterative optimization and quality control through interactive modules and comprehensive evaluation methods.
Contribution
The paper introduces Oasis, a novel system combining customizable curation and multi-faceted assessment tools for large-scale language model data preparation.
Findings
Curated an 800GB bilingual corpus using Oasis.
Demonstrated improved data quality through iterative curation.
Provided a publicly available dataset for benchmarking.
Abstract
Data is one of the most critical elements in building a large language model. However, existing systems either fail to customize a corpus curation pipeline or neglect to leverage comprehensive corpus assessment for iterative optimization of the curation. To this end, we present a pretraining corpus curation and assessment platform called Oasis -- a one-stop system for data quality improvement and quantification with user-friendly interactive interfaces. Specifically, the interactive modular rule filter module can devise customized rules according to explicit feedback. The debiased neural filter module builds the quality classification dataset in a negative-centric manner to remove the undesired bias. The adaptive document deduplication module could execute large-scale deduplication with limited memory resources. These three parts constitute the customized data curation module. And in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Natural Language Processing Techniques · Topic Modeling
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Dropout · Softmax · Absolute Position Encodings · Residual Connection · Position-Wise Feed-Forward Layer · Label Smoothing
