AutoPureData: Automated Filtering of Undesirable Web Data to Update LLM   Knowledge

Praneeth Vadlapati

arXiv:2406.19271·cs.CL·February 28, 2025

AutoPureData: Automated Filtering of Undesirable Web Data to Update LLM Knowledge

Praneeth Vadlapati

PDF

Open Access 1 Repo

TL;DR

AutoPureData is a system that automatically filters web data to ensure only safe and relevant information updates large language models, improving their reliability and safety for various applications.

Contribution

The paper introduces AutoPureData, a novel system that automatically purifies web data for updating LLMs, achieving high accuracy in filtering unsafe and undesirable content.

Findings

01

Achieved 97% accuracy in removing unsafe text.

02

Achieved 86% accuracy in filtering undesirable text.

03

Enabled cross-lingual retrieval for updated LLM knowledge.

Abstract

Up-to-date and reliable language models are consistently sought after and are essential in various applications. Typically, models are trained on a fixed dataset and then deployed globally. However, the knowledge of the models becomes outdated. Enabling automatic updation of AI knowledge using web data involves significant concerns regarding the model's safety and quality due to a threat from unsafe and undesirable text across the web. The purity of new data was essential for updating knowledge of language models to maintain their reliability. This paper proposes AutoPureData, a system that automatically collects and purifies web data. The system loaded a sample of web data. Utilizing existing trusted AI models, it successfully eliminated unsafe text with an accuracy of 97% and undesirable text with an accuracy of 86%, demonstrating the system's effectiveness in purifying the data. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Pro-GenAI/AutoPureData
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational Physics and Python Applications · Advanced Computational Techniques and Applications

MethodsAttention Is All You Need · Byte Pair Encoding · Dense Connections · Linear Warmup With Linear Decay · Linear Layer · BART · Layer Normalization · Attention Dropout · Residual Connection · WordPiece