Unveiling Language-Specific Features in Large Language Models via Sparse Autoencoders
Boyi Deng, Yu Wan, Yidan Zhang, Baosong Yang, Fuli Feng

TL;DR
This paper uses Sparse Autoencoders to analyze and identify language-specific features in Large Language Models, revealing their role in multilingual capabilities and enabling improved language control.
Contribution
It introduces a novel SAE-based method and metric to identify language-specific features, demonstrating their impact on multilingual abilities and steering control in LLMs.
Findings
Some SAE features are strongly language-specific.
Ablating these features affects only certain languages.
Combining features enhances language control.
Abstract
The mechanisms behind multilingual capabilities in Large Language Models (LLMs) have been examined using neuron-based or internal-activation-based methods. However, these methods often face challenges such as superposition and layer-wise activation variance, which limit their reliability. Sparse Autoencoders (SAEs) offer a more nuanced analysis by decomposing the activations of LLMs into a sparse linear combination of SAE features. We introduce a novel metric to assess the monolinguality of features obtained from SAEs, discovering that some features are strongly related to specific languages. Additionally, we show that ablating these SAE features only significantly reduces abilities in one language of LLMs, leaving others almost unaffected. Interestingly, we find some languages have multiple synergistic SAE features, and ablating them together yields greater improvement than ablating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
