Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

Hengyuan Zhang; Zhihao Zhang; Mingyang Wang; Zunhai Su; Yiwei Wang; Qianli Wang; Shuzhou Yuan; Ercong Nie; Xufeng Duan; Feijiang Han; Qibo Xue; Zeping Yu; Chenming Shang; Xiao Liang; Jing Xiong; Hui Shen; Chaofan Tao; Zhengwu Liu; Senjie Jin; Zhiheng Xi; Dongdong Zhang; Sophia Ananiadou; Tao Gui; Ruobing Xie; Hayden Kwok-Hay So; Hinrich Sch\"utze; Xuanjing Huang; Qi Zhang; Ngai Wong

arXiv:2601.14004·cs.CL·April 15, 2026

Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

Hengyuan Zhang, Zhihao Zhang, Mingyang Wang, Zunhai Su, Yiwei Wang, Qianli Wang, Shuzhou Yuan, Ercong Nie, Xufeng Duan, Feijiang Han, Qibo Xue, Zeping Yu, Chenming Shang, Xiao Liang, Jing Xiong, Hui Shen, Chaofan Tao, Zhengwu Liu, Senjie Jin, Zhiheng Xi, Dongdong Zhang

PDF

1 Repo

TL;DR

This paper presents a practical framework for mechanistic interpretability in large language models, emphasizing actionable diagnosis and intervention to improve model alignment, capability, and efficiency.

Contribution

It introduces a structured 'Locate, Steer, and Improve' pipeline, categorizes methods, and operationalizes MI as an actionable approach for model enhancement.

Findings

01

Framework enables tangible improvements in model alignment, capability, and efficiency.

02

Categorization of localizing and steering methods based on interpretable objects.

03

Operationalizes MI as an actionable methodology for model optimization.

Abstract

Mechanistic Interpretability (MI) has emerged as a vital approach to demystify the opaque decision-making of Large Language Models (LLMs). However, existing reviews primarily treat MI as an observational science, summarizing analytical insights while lacking a systematic framework for actionable intervention. To bridge this gap, we present a practical survey structured around the pipeline: "Locate, Steer, and Improve." We formally categorize Localizing (diagnosis) and Steering (intervention) methods based on specific Interpretable Objects to establish a rigorous intervention protocol. Furthermore, we demonstrate how this framework enables tangible improvements in Alignment, Capability, and Efficiency, effectively operationalizing MI as an actionable methodology for model optimization. The curated paper list of this work is available at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rattlesnakey/Awesome-Actionable-MI-Survey
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.