A Review of the Challenges with Massive Web-mined Corpora Used in Large   Language Models Pre-Training

Micha{\l} Pere{\l}kiewicz; Rafa{\l} Po\'swiata

arXiv:2407.07630·cs.CL·July 11, 2024·1 cites

A Review of the Challenges with Massive Web-mined Corpora Used in Large Language Models Pre-Training

Micha{\l} Pere{\l}kiewicz, Rafa{\l} Po\'swiata

PDF

Open Access

TL;DR

This paper reviews the significant challenges in using large web-mined data for training language models, emphasizing issues like noise, duplication, bias, and privacy, and discusses current mitigation strategies and future research directions.

Contribution

It provides a comprehensive overview of challenges and gaps in data quality and ethics in web-mined corpora for LLM pre-training, guiding future improvements.

Findings

01

Identification of key challenges such as noise and bias

02

Analysis of current data cleaning and bias mitigation methods

03

Highlighting gaps and proposing future research directions

Abstract

This article presents a comprehensive review of the challenges associated with using massive web-mined corpora for the pre-training of large language models (LLMs). This review identifies key challenges in this domain, including challenges such as noise (irrelevant or misleading information), duplication of content, the presence of low-quality or incorrect information, biases, and the inclusion of sensitive or personal information in web-mined corpora. Addressing these issues is crucial for the development of accurate, reliable, and ethically responsible language models. Through an examination of current methodologies for data cleaning, pre-processing, bias detection and mitigation, we highlight the gaps in existing approaches and suggest directions for future research. Our discussion aims to catalyze advancements in developing more sophisticated and ethically responsible LLMs.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling