Disentangling concept semantics via multilingual averaging in Sparse Autoencoders

Cliff O'Reilly; Ernesto Jimenez-Ruiz; Tillman Weyde

arXiv:2508.14275·cs.CL·August 21, 2025

Disentangling concept semantics via multilingual averaging in Sparse Autoencoders

Cliff O'Reilly, Ernesto Jimenez-Ruiz, Tillman Weyde

PDF

Open Access

TL;DR

This paper introduces a method to disentangle concept semantics in Large Language Models by averaging concept activations across multiple languages using Sparse Autoencoders, improving interpretability and alignment with formal knowledge representations.

Contribution

It proposes a novel multilingual averaging technique to isolate concept semantics in LLMs, enhancing interpretability and alignment with ontology-based knowledge.

Findings

01

Conceptual averaging improves alignment with ground truth mappings.

02

Multilingual averaging yields more accurate concept representations.

03

Method enhances mechanistic interpretability of LLM internal states.

Abstract

Connecting LLMs with formal knowledge representation and reasoning is a promising approach to address their shortcomings. Embeddings and sparse autoencoders are widely used to represent textual content, but the semantics are entangled with syntactic and language-specific information. We propose a method that isolates concept semantics in Large Langue Models by averaging concept activations derived via Sparse Autoencoders. We create English text representations from OWL ontology classes, translate the English into French and Chinese and then pass these texts as prompts to the Gemma 2B LLM. Using the open source Gemma Scope suite of Sparse Autoencoders, we obtain concept activations for each class and language version. We average the different language activations to derive a conceptual average. We then correlate the conceptual averages with a ground truth mapping between ontology…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis