IF-GUIDE: Influence Function-Guided Detoxification of LLMs
Zachary Coalson, Juhan Bae, Nicholas Carlini, Sanghyun Hong

TL;DR
IF-GUIDE introduces a proactive influence function-based method to identify and suppress toxic training data in large language models, significantly reducing toxicity without relying on human preference data.
Contribution
The paper presents a novel influence function adaptation that effectively detects harmful training data for toxicity mitigation in LLMs, outperforming existing alignment methods.
Findings
Reduces explicit and implicit toxicity by up to 10×
Outperforms baseline alignment methods like DPO and RAD
Effective with smaller models, using fewer parameters for influence scoring
Abstract
We study how training data contributes to the emergence of toxic behaviors in large language models. Most prior work on reducing model toxicity adopts reactive approaches, such as fine-tuning pre-trained (and potentially toxic) models to align them with human values. In contrast, we propose a proactive approach, IF-GUIDE, that leverages influence functions to identify and suppress harmful tokens in the training data. To this end, we first show that standard influence functions are ineffective at discovering harmful training records. We then present a novel adaptation that measures token-level attributions from training data to model toxicity, along with techniques for selecting toxic training documents and a learning objective that can be integrated into both pre-training and fine-tuning. Moreover, IF-GUIDE does not rely on human-preference data, which is typically required by existing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsRedox biology and oxidative stress
MethodsDirect Preference Optimization · ALIGN
