Interpretation Meets Safety: A Survey on Interpretation Methods and Tools for Improving LLM Safety
Seongmin Lee, Aeree Cho, Grace C. Kim, ShengYun Peng, Mansi Phute, Duen Horng Chau

TL;DR
This survey reviews interpretation methods and tools that enhance the safety of large language models, providing a unified framework and taxonomy to guide future research and practical safety improvements.
Contribution
It introduces the first comprehensive survey connecting interpretation techniques with safety improvements in LLMs, including a novel taxonomy organized by workflow stages.
Findings
Nearly 70 works summarized in the taxonomy
Identification of key safety-focused interpretation methods
Discussion of open challenges and future directions
Abstract
As large language models (LLMs) see wider real-world use, understanding and mitigating their unsafe behaviors is critical. Interpretation techniques can reveal causes of unsafe outputs and guide safety, but such connections with safety are often overlooked in prior surveys. We present the first survey that bridges this gap, introducing a unified framework that connects safety-focused interpretation methods, the safety enhancements they inform, and the tools that operationalize them. Our novel taxonomy, organized by LLM workflow stages, summarizes nearly 70 works at their intersections. We conclude with open challenges and future directions. This timely survey helps researchers and practitioners navigate key advancements for safer, more interpretable LLMs.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning
