TL;DR
LangFIR introduces a novel method to identify sparse, language-specific features in multilingual language models using only monolingual data and random-token filtering, enabling effective language control.
Contribution
This work presents LangFIR, a new approach that discovers language-specific features without requiring multilingual or parallel data, outperforming existing methods in language steering accuracy.
Findings
LangFIR finds highly sparse, language-specific features in residual streams.
Directional ablation of these features increases cross-entropy loss for the target language.
LangFIR achieves superior BLEU scores across multiple models, datasets, and languages.
Abstract
Large language models (LLMs) show strong multilingual capabilities, yet reliably controlling the language of their outputs remains difficult. Representation-level steering addresses this by adding language-specific vectors to model activations at inference time, but identifying language-specific directions in the residual stream often relies on multilingual or parallel data that can be expensive to obtain. Sparse autoencoders (SAEs) decompose residual activations into interpretable, sparse feature directions and offer a natural basis for this search, yet existing SAE-based approaches face the same data constraint. We introduce LangFIR (Language Feature Identification via Random-token Filtering), a method that discovers language-specific SAE features using only a small amount of monolingual data and random-token sequences. Many SAE features consistently activated by target-language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
