IF-GUIDE: Influence Function-Guided Detoxification of LLMs

Zachary Coalson; Juhan Bae; Nicholas Carlini; Sanghyun Hong

arXiv:2506.01790·cs.LG·December 8, 2025

IF-GUIDE: Influence Function-Guided Detoxification of LLMs

Zachary Coalson, Juhan Bae, Nicholas Carlini, Sanghyun Hong

PDF

Open Access 1 Repo 1 Video

TL;DR

IF-GUIDE introduces a proactive influence function-based method to identify and suppress toxic training data in large language models, significantly reducing toxicity without relying on human preference data.

Contribution

The paper presents a novel influence function adaptation that effectively detects harmful training data for toxicity mitigation in LLMs, outperforming existing alignment methods.

Findings

01

Reduces explicit and implicit toxicity by up to 10×

02

Outperforms baseline alignment methods like DPO and RAD

03

Effective with smaller models, using fewer parameters for influence scoring

Abstract

We study how training data contributes to the emergence of toxic behaviors in large language models. Most prior work on reducing model toxicity adopts reactive approaches, such as fine-tuning pre-trained (and potentially toxic) models to align them with human values. In contrast, we propose a proactive approach, IF-GUIDE, that leverages influence functions to identify and suppress harmful tokens in the training data. To this end, we first show that standard influence functions are ineffective at discovering harmful training records. We then present a novel adaptation that measures token-level attributions from training data to model toxicity, along with techniques for selecting toxic training documents and a learning objective that can be integrated into both pre-training and fine-tuning. Moreover, IF-GUIDE does not rely on human-preference data, which is typically required by existing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ztcoalson/if-guide
pytorchOfficial

Videos

IF-Guide: Influence Function-Guided Detoxification of LLMs· slideslive

Taxonomy

TopicsRedox biology and oxidative stress

MethodsDirect Preference Optimization · ALIGN