LinguaLens: Towards Interpreting Linguistic Mechanisms of Large Language Models via Sparse Auto-Encoder

Yi Jing; Zijun Yao; Hongzhu Guo; Lingxu Ran; Xiaozhi Wang; Lei Hou; Juanzi Li

arXiv:2502.20344·cs.CL·September 16, 2025

LinguaLens: Towards Interpreting Linguistic Mechanisms of Large Language Models via Sparse Auto-Encoder

Yi Jing, Zijun Yao, Hongzhu Guo, Lingxu Ran, Xiaozhi Wang, Lei Hou, Juanzi Li

PDF

Open Access 1 Video

TL;DR

LinguaLens introduces a comprehensive framework using Sparse Auto-Encoders to analyze and interpret the linguistic mechanisms within large language models across multiple languages and linguistic features.

Contribution

This work presents a novel systematic approach for analyzing linguistic mechanisms in LLMs, including a large-scale counterfactual dataset and insights into cross-layer and cross-lingual representations.

Findings

01

Intrinsic linguistic representations in LLMs identified

02

Patterns of cross-layer and cross-lingual distribution uncovered

03

Demonstrates potential for controlling model outputs

Abstract

Large language models (LLMs) demonstrate exceptional performance on tasks requiring complex linguistic abilities, such as reference disambiguation and metaphor recognition/generation. Although LLMs possess impressive capabilities, their internal mechanisms for processing and representing linguistic knowledge remain largely opaque. Prior research on linguistic mechanisms is limited by coarse granularity, limited analysis scale, and narrow focus. In this study, we propose LinguaLens, a systematic and comprehensive framework for analyzing the linguistic mechanisms of large language models, based on Sparse Auto-Encoders (SAEs). We extract a broad set of Chinese and English linguistic features across four dimensions (morphology, syntax, semantics, and pragmatics). By employing counterfactual methods, we construct a large-scale counterfactual dataset of linguistic features for mechanism…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

LinguaLens: Towards Interpreting Linguistic Mechanisms of Large Language Models via Sparse Auto-Encoder· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques