Interpretation Meets Safety: A Survey on Interpretation Methods and Tools for Improving LLM Safety

Seongmin Lee; Aeree Cho; Grace C. Kim; ShengYun Peng; Mansi Phute; Duen Horng Chau

arXiv:2506.05451·cs.SE·June 9, 2025

Interpretation Meets Safety: A Survey on Interpretation Methods and Tools for Improving LLM Safety

Seongmin Lee, Aeree Cho, Grace C. Kim, ShengYun Peng, Mansi Phute, Duen Horng Chau

PDF

Open Access 1 Video

TL;DR

This survey reviews interpretation methods and tools that enhance the safety of large language models, providing a unified framework and taxonomy to guide future research and practical safety improvements.

Contribution

It introduces the first comprehensive survey connecting interpretation techniques with safety improvements in LLMs, including a novel taxonomy organized by workflow stages.

Findings

01

Nearly 70 works summarized in the taxonomy

02

Identification of key safety-focused interpretation methods

03

Discussion of open challenges and future directions

Abstract

As large language models (LLMs) see wider real-world use, understanding and mitigating their unsafe behaviors is critical. Interpretation techniques can reveal causes of unsafe outputs and guide safety, but such connections with safety are often overlooked in prior surveys. We present the first survey that bridges this gap, introducing a unified framework that connects safety-focused interpretation methods, the safety enhancements they inform, and the tools that operationalize them. Our novel taxonomy, organized by LLM workflow stages, summarizes nearly 70 works at their intersections. We conclude with open challenges and future directions. This timely survey helps researchers and practitioners navigate key advancements for safer, more interpretable LLMs.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Interpretation Meets Safety: A Survey on Interpretation Methods and Tools for Improving LLM Safety· underline

Taxonomy

TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning