LP Data Pipeline: Lightweight, Purpose-driven Data Pipeline for Large Language Models
Yungi Kim, Hyunsoo Ha, Seonghoon Yang, Sukyung Lee, Jihoo Kim, Chanjun, Park

TL;DR
The LP Data Pipeline offers a CPU-based, efficient, and cost-effective framework for creating high-quality, purpose-driven datasets for large language models, reducing resource requirements and broadening accessibility.
Contribution
We introduce a novel CPU-only data pipeline that streamlines dataset creation for LLMs, enabling tailored, high-quality datasets without GPU reliance.
Findings
Reduces dataset preparation time and cost significantly.
Maintains high data quality comparable to GPU-based methods.
Enables creation of domain-specific and language-specific datasets.
Abstract
Creating high-quality, large-scale datasets for large language models (LLMs) often relies on resource-intensive, GPU-accelerated models for quality filtering, making the process time-consuming and costly. This dependence on GPUs limits accessibility for organizations lacking significant computational infrastructure. To address this issue, we introduce the Lightweight, Purpose-driven (LP) Data Pipeline, a framework that operates entirely on CPUs to streamline the processes of dataset extraction, filtering, and curation. Based on our four core principles, the LP Data Pipeline significantly reduces preparation time and cost while maintaining high data quality. Importantly, our pipeline enables the creation of purpose-driven datasets tailored to specific domains and languages, enhancing the applicability of LLMs in specialized contexts. We anticipate that our pipeline will lower the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
