Beyond Line-Level Filtering for the Pretraining Corpora of LLMs

Chanwoo Park; Suyoung Park; Yelim Ahn; Jongmin Kim; Jongyeon Park; Jaejin Lee

arXiv:2510.24139·cs.CL·October 29, 2025

Beyond Line-Level Filtering for the Pretraining Corpora of LLMs

Chanwoo Park, Suyoung Park, Yelim Ahn, Jongmin Kim, Jongyeon Park, Jaejin Lee

PDF

TL;DR

This paper introduces pattern-aware filtering methods for pretraining data that improve language model performance by better retaining valuable content during data cleaning.

Contribution

It proposes two novel filtering techniques, PLD and PTF, which consider sequential distribution to preserve important content often lost in traditional filtering.

Findings

01

Improved performance on multiple-choice benchmarks.

02

Enhanced generative question-answering accuracy on SQuAD v1 and KorQuAD v1.

03

Consistent gains across English and Korean language models.

Abstract

While traditional line-level filtering techniques, such as line-level deduplication and trailing-punctuation filters, are commonly used, these basic methods can sometimes discard valuable content, negatively affecting downstream performance. In this paper, we introduce two methods-pattern-aware line-level deduplication (PLD) and pattern-aware trailing punctuation filtering (PTF)-by enhancing the conventional filtering techniques. Our approach not only considers line-level signals but also takes into account their sequential distribution across documents, enabling us to retain structurally important content that might otherwise be removed. We evaluate these proposed methods by training small language models (1 B parameters) in both English and Korean. The results demonstrate that our methods consistently improve performance on multiple-choice benchmarks and significantly enhance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.