A Review of the Challenges with Massive Web-mined Corpora Used in Large Language Models Pre-Training
Micha{\l} Pere{\l}kiewicz, Rafa{\l} Po\'swiata

TL;DR
This paper reviews the significant challenges in using large web-mined data for training language models, emphasizing issues like noise, duplication, bias, and privacy, and discusses current mitigation strategies and future research directions.
Contribution
It provides a comprehensive overview of challenges and gaps in data quality and ethics in web-mined corpora for LLM pre-training, guiding future improvements.
Findings
Identification of key challenges such as noise and bias
Analysis of current data cleaning and bias mitigation methods
Highlighting gaps and proposing future research directions
Abstract
This article presents a comprehensive review of the challenges associated with using massive web-mined corpora for the pre-training of large language models (LLMs). This review identifies key challenges in this domain, including challenges such as noise (irrelevant or misleading information), duplication of content, the presence of low-quality or incorrect information, biases, and the inclusion of sensitive or personal information in web-mined corpora. Addressing these issues is crucial for the development of accurate, reliable, and ethically responsible language models. Through an examination of current methodologies for data cleaning, pre-processing, bias detection and mitigation, we highlight the gaps in existing approaches and suggest directions for future research. Our discussion aims to catalyze advancements in developing more sophisticated and ethically responsible LLMs.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
