Rethinking KenLM: Good and Bad Model Ensembles for Efficient Text   Quality Filtering in Large Web Corpora

Yungi Kim; Hyunsoo Ha; Sukyung Lee; Jihoo Kim; Seonghoon Yang; Chanjun; Park

arXiv:2409.09613·cs.CL·September 17, 2024

Rethinking KenLM: Good and Bad Model Ensembles for Efficient Text Quality Filtering in Large Web Corpora

Yungi Kim, Hyunsoo Ha, Sukyung Lee, Jihoo Kim, Seonghoon Yang, Chanjun, Park

PDF

Open Access 1 Video

TL;DR

This paper proposes an ensemble of two contrasting KenLM models trained on high- and low-quality data to improve filtering of web corpora, effectively reducing noise while maintaining quality with minimal computational cost.

Contribution

It introduces a novel ensemble approach combining good and bad KenLMs to enhance web data filtering for large language model training.

Findings

01

Significantly reduces noisy content in web corpora

02

Preserves high-quality data effectively

03

Operates with minimal computational overhead

Abstract

With the increasing demand for substantial amounts of high-quality data to train large language models (LLMs), efficiently filtering large web corpora has become a critical challenge. For this purpose, KenLM, a lightweight n-gram-based language model that operates on CPUs, is widely used. However, the traditional method of training KenLM utilizes only high-quality data and, consequently, does not explicitly learn the linguistic patterns of low-quality data. To address this issue, we propose an ensemble approach that leverages two contrasting KenLMs: (i) Good KenLM, trained on high-quality data; and (ii) Bad KenLM, trained on low-quality data. Experimental results demonstrate that our approach significantly reduces noisy content while preserving high-quality content compared to the traditional KenLM training method. This indicates that our method can be a practical solution with minimal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Rethinking KenLM: Good and Bad Model Ensembles for Efficient Text Quality Filtering in Large Web Corpora· underline

Taxonomy

TopicsNatural Language Processing Techniques · Text and Document Classification Technologies · Web Data Mining and Analysis