Detecting and Filtering Unsafe Training Data via Data Attribution with Denoised Representation
Yijun Pan, Taiwei Shi, Jieyu Zhao, Jiaqi W. Ma

TL;DR
This paper introduces Denoised Representation Attribution (DRA), a novel method for detecting unsafe training data in large language models by denoising data representations, outperforming existing moderation-based approaches.
Contribution
The paper proposes DRA, a new data attribution technique that effectively denoises representations to improve unsafe data detection in LLM training.
Findings
DRA significantly outperforms state-of-the-art moderation classifiers.
Effective detection of jailbreaks and gender bias in training data.
Denoising improves the reliability of data attribution methods.
Abstract
Large language models (LLMs) are highly sensitive to even small amounts of unsafe training data, making effective detection and filtering essential for trustworthy model development. Current state-of-the-art (SOTA) detection approaches primarily rely on moderation classifiers, which require significant computation overhead for training and are limited to predefined taxonomies. In this work, we explore data attribution approaches that measure the similarity between individual training samples and a small set of unsafe target examples, based on data representations such as hidden states or gradients. We identify a key limitation in existing methods: unsafe target texts contain both critical tokens that make them unsafe and neutral tokens (e.g., stop words or benign facts) that are necessary to form fluent language, and the latter of which makes the overall representations ``noisy'' for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning · Advanced Data Processing Techniques · Adversarial Robustness in Machine Learning
