WASD: Locating Critical Neurons as Sufficient Conditions for Explaining and Controlling LLM Behavior
Haonan Yu, Junhao Liu, Zhenyu Yan, Haoran Lin, Xin Zhang

TL;DR
WASD is a framework that identifies minimal neural conditions to explain and control large language model outputs, improving stability and accuracy over traditional methods.
Contribution
It introduces a novel approach to explain and control LLM behavior by finding sufficient neuron-activation conditions with minimal sets.
Findings
WASD produces more stable and accurate explanations than attribution graphs.
The method effectively controls cross-lingual output generation.
Experiments on SST-2 and CounterFact validate the approach's effectiveness.
Abstract
Precise behavioral control of large language models (LLMs) is critical for complex applications. However, existing methods often incur high training costs, lack natural language controllability, or compromise semantic coherence. To bridge this gap, we propose WASD (unWeaving Actionable Sufficient Directives), a novel framework that explains model behavior by identifying sufficient neural conditions for token generation. Our method represents candidate conditions as neuron-activation predicates and iteratively searches for a minimal set that guarantees the current output under input perturbations. Experiments on SST-2 and CounterFact with the Gemma-2-2B model demonstrate that our approach produces explanations that are more stable, accurate, and concise than conventional attribution graphs. Moreover, through a case study on controlling cross-lingual output generation, we validated the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
