Disentangling concept semantics via multilingual averaging in Sparse Autoencoders
Cliff O'Reilly, Ernesto Jimenez-Ruiz, Tillman Weyde

TL;DR
This paper introduces a method to disentangle concept semantics in Large Language Models by averaging concept activations across multiple languages using Sparse Autoencoders, improving interpretability and alignment with formal knowledge representations.
Contribution
It proposes a novel multilingual averaging technique to isolate concept semantics in LLMs, enhancing interpretability and alignment with ontology-based knowledge.
Findings
Conceptual averaging improves alignment with ground truth mappings.
Multilingual averaging yields more accurate concept representations.
Method enhances mechanistic interpretability of LLM internal states.
Abstract
Connecting LLMs with formal knowledge representation and reasoning is a promising approach to address their shortcomings. Embeddings and sparse autoencoders are widely used to represent textual content, but the semantics are entangled with syntactic and language-specific information. We propose a method that isolates concept semantics in Large Langue Models by averaging concept activations derived via Sparse Autoencoders. We create English text representations from OWL ontology classes, translate the English into French and Chinese and then pass these texts as prompts to the Gemma 2B LLM. Using the open source Gemma Scope suite of Sparse Autoencoders, we obtain concept activations for each class and language version. We average the different language activations to derive a conceptual average. We then correlate the conceptual averages with a ground truth mapping between ontology…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis
