LP Data Pipeline: Lightweight, Purpose-driven Data Pipeline for Large   Language Models

Yungi Kim; Hyunsoo Ha; Seonghoon Yang; Sukyung Lee; Jihoo Kim; Chanjun; Park

arXiv:2411.11289·cs.CL·November 19, 2024

LP Data Pipeline: Lightweight, Purpose-driven Data Pipeline for Large Language Models

Yungi Kim, Hyunsoo Ha, Seonghoon Yang, Sukyung Lee, Jihoo Kim, Chanjun, Park

PDF

Open Access

TL;DR

The LP Data Pipeline offers a CPU-based, efficient, and cost-effective framework for creating high-quality, purpose-driven datasets for large language models, reducing resource requirements and broadening accessibility.

Contribution

We introduce a novel CPU-only data pipeline that streamlines dataset creation for LLMs, enabling tailored, high-quality datasets without GPU reliance.

Findings

01

Reduces dataset preparation time and cost significantly.

02

Maintains high data quality comparable to GPU-based methods.

03

Enables creation of domain-specific and language-specific datasets.

Abstract

Creating high-quality, large-scale datasets for large language models (LLMs) often relies on resource-intensive, GPU-accelerated models for quality filtering, making the process time-consuming and costly. This dependence on GPUs limits accessibility for organizations lacking significant computational infrastructure. To address this issue, we introduce the Lightweight, Purpose-driven (LP) Data Pipeline, a framework that operates entirely on CPUs to streamline the processes of dataset extraction, filtering, and curation. Based on our four core principles, the LP Data Pipeline significantly reduces preparation time and cost while maintaining high data quality. Importantly, our pipeline enables the creation of purpose-driven datasets tailored to specific domains and languages, enhancing the applicability of LLMs in specialized contexts. We anticipate that our pipeline will lower the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling