Leveraging LLMs to Create Content Corpora for Niche Domains
Franklin Zhang, Sonya Zhang, Alon Halevy

TL;DR
This paper presents a novel framework using Large Language Models to efficiently create high-quality, domain-specific content corpora from web data, demonstrated through a habit formation application with positive user feedback.
Contribution
It introduces a scalable LLM-based data curation pipeline for niche domains, including content extraction, filtering, and deduplication, validated in the behavior education domain.
Findings
Extracted 3,531 unique challenges from 15K webpages
Achieved a user satisfaction score of 4.3/5
91% of users willing to use curated content
Abstract
Constructing specialized content corpora from vast, unstructured web sources for domain-specific applications poses substantial data curation challenges. In this paper, we introduce a streamlined approach for generating high-quality, domain-specific corpora by efficiently acquiring, filtering, structuring, and cleaning web-based data. We showcase how Large Language Models (LLMs) can be leveraged to address complex data curation at scale, and propose a strategical framework incorporating LLM-enhanced techniques for structured content extraction and semantic deduplication. We validate our approach in the behavior education domain through its integration into 30 Day Me, a habit formation application. Our data pipeline, named 30DayGen, enabled the extraction and synthesis of 3,531 unique 30-day challenges from over 15K webpages. A user survey reports a satisfaction score of 4.3 out of 5,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Mathematics, Computing, and Information Processing
