Attribution-Guided Pruning for Insight and Control: Circuit Discovery and Targeted Correction in Small-scale LLMs

Sayed Mohammad Vakilzadeh Hatefi; Maximilian Dreyer; Reduan Achtibat; Patrick Kahardipraja; Thomas Wiegand; Wojciech Samek; Alexander Binder; Sebastian Lapuschkin

arXiv:2506.13727·cs.LG·May 8, 2026

Attribution-Guided Pruning for Insight and Control: Circuit Discovery and Targeted Correction in Small-scale LLMs

Sayed Mohammad Vakilzadeh Hatefi, Maximilian Dreyer, Reduan Achtibat, Patrick Kahardipraja, Thomas Wiegand, Wojciech Samek, Alexander Binder, Sebastian Lapuschkin

PDF

1 Repo

TL;DR

This paper introduces an attribution-guided pruning method using Layer-wise Relevance Propagation to identify and modify specific circuits in small-scale LLMs, improving interpretability and control over undesirable behaviors.

Contribution

It presents a novel circuit discovery and targeted correction technique that effectively reduces toxic outputs and repetitive text without harming overall performance.

Findings

01

Pruning ~0.3% of neurons reduces toxic outputs significantly.

02

Pruning ~0.03% of weights mitigates repetitive text.

03

Method transfers across different small-scale language models.

Abstract

Large Language Models (LLMs) are widely deployed in real-world applications, yet their internal mechanisms remain difficult to interpret and control, limiting our ability to diagnose and correct undesirable behaviors. Mechanistic interpretability addresses this challenge by identifying circuits -- subsets of model components responsible for specific behaviors. However, discovering such circuits in LLMs remains difficult due to their scale and complexity. We frame circuit discovery as identifying parameters that contribute most to model outputs on task-specific inputs, and use Layer-wise Relevance Propagation (LRP) with reference samples to attribute and extract these components via pruning. Building on this, we introduce contrastive relevance to isolate circuits associated with undesired behaviors while preserving general capabilities, enabling targeted model correction. On OPT-125M, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

erfanhatefi/SparC3
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.