Neuron-Anchored Rule Extraction for Large Language Models via Contrastive Hierarchical Ablation
Francesco Sovrano, Gabriele Dominici, Marc Langheinrich

TL;DR
This paper introduces MechaRule, a novel method for grounding rule extraction in large language model circuits by localizing sparse neurons called agonists, improving interpretability and robustness.
Contribution
MechaRule provides an efficient, data-driven approach to identify neurons responsible for specific behaviors in LLMs, connecting mechanistic interpretability with symbolic rule extraction.
Findings
MechaRule recalls 96.8% of high-effect agonists in experiments.
Suppressing localized agonists reduces arithmetic accuracy by up to 71.1%.
Suppressing localized agonists reduces jailbreak success by up to 8.8%.
Abstract
A key goal of explainable AI (XAI) is to express the decision logic of large language models (LLMs) in symbolic form and link it to internal mechanisms. Global rule-extraction methods typically learn symbolic surrogates without grounding rules in model circuitry, while mechanistic interpretability can connect behaviors to neuron sets but often depends on hand-crafted hypotheses and expensive neuron-level interventions. We introduce MechaRule, a pipeline that grounds rule extraction in LLM circuits by efficiently localizing sparse neurons called agonists, whose activation neutralization disrupts rule-related behaviors. MechaRule rests on two empirical observations. First, within a fixed baseline/flip regime, sparse agonist effects can be approximately monotone and saturating: a few dominant neuron activations can overtop weaker ones at coarse scales, while overlapping neurons flip many…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
