Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMs

Igor Shilov; Alex Cloud; Aryo Pradipta Gema; Jacob Goldman-Wetzler; Nina Panickssery; Henry Sleight; Erik Jones; Cem Anil

arXiv:2512.05648·cs.LG·December 8, 2025

Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMs

Igor Shilov, Alex Cloud, Aryo Pradipta Gema, Jacob Goldman-Wetzler, Nina Panickssery, Henry Sleight, Erik Jones, Cem Anil

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Selective Gradient Masking (SGTM), a robust method for removing specific knowledge from large language models, especially effective under label noise and adversarial fine-tuning scenarios.

Contribution

We propose SGTM, an improved variant of Gradient Routing, that enhances knowledge removal robustness against label noise and adversarial fine-tuning in large language models.

Findings

01

SGTM outperforms data filtering and previous Gradient Routing variants in noisy label settings.

02

SGTM requires significantly more fine-tuning steps than unlearning methods to revert to baseline performance.

03

SGTM effectively removes targeted knowledge in bilingual and biology tasks, demonstrating robustness and safety benefits.

Abstract

Large Language Models increasingly possess capabilities that carry dual-use risks. While data filtering has emerged as a pretraining-time mitigation, it faces significant challenges: labeling whether data is harmful is expensive at scale, and given improving sample efficiency with larger models, even small amounts of mislabeled content could give rise to dangerous capabilities. To address risks associated with mislabeled harmful content, prior work proposed Gradient Routing (Cloud et al., 2024) -- a technique that localizes target knowledge into a dedicated subset of model parameters so they can later be removed. We explore an improved variant of Gradient Routing, which we call Selective GradienT Masking (SGTM), with particular focus on evaluating its robustness to label noise. SGTM zero-masks selected gradients such that target domain examples only update their dedicated parameters. We…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

1. **Adversarial robustness**: the detailed discussions on mislabeled content and adversarial fine-tuning are valuable and highly relevant to the community. 2. **Clear presentations**: the figures and visualizations are informative and well-designed.

Weaknesses

1. **Insufficient Evidence**: This is my primary concern. The evaluation relies solely on model loss, which may not adequately capture downstream perfromance differences that truly matter. It is unclear to me whether higher loss indeed indicates better forgetting. Including additional evaluations for forgetting and general performance retention would substantially strengthen the paper's empirical support. 2. **Limited Scale**: As noted in section 6, the experiments use very small model and dat

Reviewer 02Rating 4Confidence 4

Strengths

Empirical results on multiple settings show an improved trade-off between general capabilities and forgetting of undesirable content, compared with filtering. Moreover, it has much better performance against fine-tuning compared with a strong unlearning method, RMU. The method is quite simple and intuitive. Gradient masking sequesters undesirable knowledge into a small subset of parameters, while parameter masking encourages the rest of the parameters to function well even when those parameters

Weaknesses

This paper compares only with filtering and a similar previous work (Gradient Routing), but other methods have also been developed as alternatives to filtering: * https://arxiv.org/abs/2302.08582 This paper explores several training objectives and finds that a "conditional training" approach works well. It seems that SGTM could directly compare with this approach. * https://arxiv.org/abs/2505.03052 This paper has a somewhat different motivation, but they can use a more aggressive threshold on th

Reviewer 03Rating 6Confidence 3

Strengths

- clearly written, seems novel - caveat that I am not deeply familiar with the unlearning literature - experiments seem to support the basic point of improvement the authors suggest for SGTM, and are fairly thorough (ablations with related data categories are cool) - Fig 1 is great! In general the communication around tradeoffs is well done

Weaknesses

- it's odd to me that there aren't results shown for Fig 4 for GR - isn't this the main baseline we should be comparing to? - some contradictory statements around parameter subsets: in Fig 2 caption the authors that the after forget parameters are assigned, “the remaining parameters are designated to the retain data” but then discuss something called "joint" parameters in line 183 - it would be good to give more intuition here - why is SGTM more robust to label noise? it's not a priori obvious t

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Machine Learning and Data Classification · Explainable Artificial Intelligence (XAI)