Steering LLMs for Culturally Localized Generation
Simran Khanuja, Hongbin Liu, Shujian Zhang, John Lambert, Mingqing Chen, Rajiv Mathews, Lun Wang

TL;DR
This paper introduces Cultural Embeddings (CuE), a mechanistic interpretability approach to analyze and steer large language models towards culturally faithful responses, revealing and manipulating cultural biases more transparently.
Contribution
It presents a novel method using sparse autoencoders to identify interpretable cultural features in LLMs and develop white-box steering techniques for cultural localization.
Findings
CuE-based steering improves cultural faithfulness in responses.
CuE captures long-tail cultural concepts better than prompting alone.
Combining CuE with prompt-augmentation yields the best localization results.
Abstract
LLMs are deployed globally, yet produce responses biased towards cultures with abundant training data. Existing cultural localization approaches such as prompting or post-training alignment are black-box, hard to control, and do not reveal whether failures reflect missing knowledge or poor elicitation. In this paper, we address these gaps using mechanistic interpretability to uncover and manipulate cultural representations in LLMs. Leveraging sparse autoencoders, we identify interpretable features that encode culturally salient information and aggregate them into Cultural Embeddings (CuE). We use CuE both to analyze implicit cultural biases under underspecified prompts and to construct white-box steering interventions. Across multiple models, we show that CuE-based steering increases cultural faithfulness and elicits significantly rarer, long-tail cultural concepts than prompting alone.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Explainable Artificial Intelligence (XAI) · Language and cultural evolution
