Towards Intrinsic Interpretability of Large Language Models:A Survey of Design Principles and Architectures

Yutong Gao; Qinglin Meng; Yuan Zhou; Liangming Pan

arXiv:2604.16042·cs.CL·April 21, 2026

Towards Intrinsic Interpretability of Large Language Models:A Survey of Design Principles and Architectures

Yutong Gao, Qinglin Meng, Yuan Zhou, Liangming Pan

PDF

1 Repo

TL;DR

This survey reviews recent advances in intrinsic interpretability of large language models, focusing on design principles that enhance transparency directly within model architectures.

Contribution

It categorizes existing approaches into five paradigms and discusses open challenges and future directions in intrinsic interpretability for LLMs.

Findings

01

Identifies five key design paradigms for intrinsic interpretability.

02

Highlights open challenges and future research directions.

03

Provides a comprehensive categorization of recent approaches.

Abstract

While Large Language Models (LLMs) have achieved strong performance across many NLP tasks, their opaque internal mechanisms hinder trustworthiness and safe deployment. Existing surveys in explainable AI largely focus on post-hoc explanation methods that interpret trained models through external approximations. In contrast, intrinsic interpretability, which builds transparency directly into model architectures and computations, has recently emerged as a promising alternative. This paper presents a systematic review of the recent advances in intrinsic interpretability for LLMs, categorizing existing approaches into five design paradigms: functional transparency, concept alignment, representational decomposability, explicit modularization, and latent sparsity induction. We further discuss open challenges and outline future research directions in this emerging field. The paper list is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

PKU-PILLAR-Group/Survey-Intrinsic-Interpretability-of-LLMs
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.