Knowledge Localization in Mixture-of-Experts LLMs Using Cross-Lingual Inconsistency
Lucas Bandarkar, Alan Ansell, Trevor Cohn

TL;DR
This paper introduces a novel interpretability method for mixture-of-experts LLMs by analyzing cross-lingual inconsistencies to localize knowledge-relevant model components, improving understanding of model behavior.
Contribution
It presents a new framework that uses cross-lingual inconsistency to identify and localize knowledge-specific experts in MoE LLMs, enhancing interpretability.
Findings
Deactivating about 20 experts affects over 40% of correct answers.
Cross-lingual inconsistency reveals knowledge localization in MoE models.
Method is scalable and effective for complex LLMs.
Abstract
Modern LLMs continue to exhibit significant variance in behavior across languages, such as being able to recall factual information in some languages but not others. While typically studied as a problem to be mitigated, in this work, we propose leveraging this cross-lingual inconsistency as a tool for interpretability in mixture-of-experts (MoE) LLMs. Our knowledge localization framework contrasts routing for sets of languages where the model correctly recalls information from languages where it fails. This allows us to isolate model components that play a functional role in answering about a piece of knowledge. Our method proceeds in two stages: (1) querying the model with difficult factual questions across a diverse set of languages to generate "success" and "failure" activation buckets and then (2) applying a statistical contrastive analysis to the MoE router logits to identify…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExpert finding and Q&A systems · Topic Modeling · Mobile Crowdsensing and Crowdsourcing
