AutoPureData: Automated Filtering of Undesirable Web Data to Update LLM Knowledge
Praneeth Vadlapati

TL;DR
AutoPureData is a system that automatically filters web data to ensure only safe and relevant information updates large language models, improving their reliability and safety for various applications.
Contribution
The paper introduces AutoPureData, a novel system that automatically purifies web data for updating LLMs, achieving high accuracy in filtering unsafe and undesirable content.
Findings
Achieved 97% accuracy in removing unsafe text.
Achieved 86% accuracy in filtering undesirable text.
Enabled cross-lingual retrieval for updated LLM knowledge.
Abstract
Up-to-date and reliable language models are consistently sought after and are essential in various applications. Typically, models are trained on a fixed dataset and then deployed globally. However, the knowledge of the models becomes outdated. Enabling automatic updation of AI knowledge using web data involves significant concerns regarding the model's safety and quality due to a threat from unsafe and undesirable text across the web. The purity of new data was essential for updating knowledge of language models to maintain their reliability. This paper proposes AutoPureData, a system that automatically collects and purifies web data. The system loaded a sample of web data. Utilizing existing trusted AI models, it successfully eliminated unsafe text with an accuracy of 97% and undesirable text with an accuracy of 86%, demonstrating the system's effectiveness in purifying the data. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Physics and Python Applications · Advanced Computational Techniques and Applications
MethodsAttention Is All You Need · Byte Pair Encoding · Dense Connections · Linear Warmup With Linear Decay · Linear Layer · BART · Layer Normalization · Attention Dropout · Residual Connection · WordPiece
