Neuron-Anchored Rule Extraction for Large Language Models via Contrastive Hierarchical Ablation

Francesco Sovrano; Gabriele Dominici; Marc Langheinrich

arXiv:2605.03058·cs.LG·May 6, 2026

Neuron-Anchored Rule Extraction for Large Language Models via Contrastive Hierarchical Ablation

Francesco Sovrano, Gabriele Dominici, Marc Langheinrich

PDF

TL;DR

This paper introduces MechaRule, a novel method for grounding rule extraction in large language model circuits by localizing sparse neurons called agonists, improving interpretability and robustness.

Contribution

MechaRule provides an efficient, data-driven approach to identify neurons responsible for specific behaviors in LLMs, connecting mechanistic interpretability with symbolic rule extraction.

Findings

01

MechaRule recalls 96.8% of high-effect agonists in experiments.

02

Suppressing localized agonists reduces arithmetic accuracy by up to 71.1%.

03

Suppressing localized agonists reduces jailbreak success by up to 8.8%.

Abstract

A key goal of explainable AI (XAI) is to express the decision logic of large language models (LLMs) in symbolic form and link it to internal mechanisms. Global rule-extraction methods typically learn symbolic surrogates without grounding rules in model circuitry, while mechanistic interpretability can connect behaviors to neuron sets but often depends on hand-crafted hypotheses and expensive neuron-level interventions. We introduce MechaRule, a pipeline that grounds rule extraction in LLM circuits by efficiently localizing sparse neurons called agonists, whose activation neutralization disrupts rule-related behaviors. MechaRule rests on two empirical observations. First, within a fixed baseline/flip regime, sparse agonist effects can be approximately monotone and saturating: a few dominant neuron activations can overtop weaker ones at coarse scales, while overlapping neurons flip many…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.