Incorporating Hierarchical Semantics in Sparse Autoencoder Architectures

Mark Muchane; Sean Richardson; Kiho Park; Victor Veitch

arXiv:2506.01197·cs.CL·June 3, 2025

Incorporating Hierarchical Semantics in Sparse Autoencoder Architectures

Mark Muchane, Sean Richardson, Kiho Park, Victor Veitch

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a hierarchical semantic model within sparse autoencoders, enhancing interpretability, reconstruction, and efficiency by explicitly capturing semantic relationships among concepts.

Contribution

It presents a novel SAE architecture that models semantic hierarchies, improving interpretability and computational efficiency in learned representations.

Findings

01

Semantic hierarchies can be learned within large language models.

02

The new architecture improves reconstruction accuracy.

03

Significant computational efficiency gains are achieved.

Abstract

Sparse dictionary learning (and, in particular, sparse autoencoders) attempts to learn a set of human-understandable concepts that can explain variation on an abstract space. A basic limitation of this approach is that it neither exploits nor represents the semantic relationships between the learned concepts. In this paper, we introduce a modified SAE architecture that explicitly models a semantic hierarchy of concepts. Application of this architecture to the internal representations of large language models shows both that semantic hierarchy can be learned, and that doing so improves both reconstruction and interpretability. Additionally, the architecture leads to significant improvements in computational efficiency.

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

- Hierarchical encoding is a very interesting topic and an important component of modeling LLM representations. Hierarchical structure is inherently quite interpretable, and intuitively resolves many issues with existing SAEs (such as splitting and absorption). - Additionally, this work is grounded in existing literature on the hierarchy of representations in language models, and the MoE architecture is an intuitive operationalization of hierarchies. - The results are promising, particularly for

Weaknesses

- Given that for each high-level feature, there are 16 or 64 sub-features, shouldn’t the authors be comparing to standard SAEs with 16x or 64x the number of normal latents (essentially the size of all of the sub-features from HSAEs combined) to ensure overall width is the same? If they are already doing this, please ignore this comment but clarify this point in the paper. - Why not always compare to both TopK and Matryoshka SAEs? For each experiment, the authors only use one or the other as the

Reviewer 02Rating 8Confidence 4

Strengths

The core idea of using a hierarchical, MoE-style architecture for SAEs is simple but practical. The gain in computational efficiency is a significant result on its own, potentially unblocking efforts to scale interpretability tools to frontier models. The experimental validation is quite thorough. The authors combined standard reconstruction loss with metrics for downstream task performance like CE loss, as well as feature absorption and cross-lingual feature redundancy. The direct comparison t

Weaknesses

One limitation is that proposed architecture implements a two-level hierarchy (parent/child). Real-world semantics are often deeper. The paper doesn't discuss the limitations of this two-level structure or how the architecture might be extended to model deeper, more complex hierarchies. The subspace dimension $s$ is a key hyperparameter, set to 4 or 8. While this is motivated by the low-rank finding from prior work and benefits efficiency, there is no sensitivity analysis or further justificati

Reviewer 03Rating 4Confidence 5

Strengths

* The authors provide a novel architecture and approach grounded in theories of hierarchical concept representations in LLMs * The method is simple and intuitive * Empirical results highlight the efficacy of the method in terms of reconstruction metrics as well as more nuanced benchmarks such as feature absorption and feature universality across language. * I think the absorption experiment is a great way to demonstrate the efficacy of this method, as it intuitively (and empirically) seems cl

Weaknesses

(Apologies for any clarity issues, I bounce between saying high- and low-, top- and low-, etc. to describe your hierarchy of features). The authors argue that this architecture is more useful for three primary reasons: 1) H-SAEs have better reconstruction, 2) H-SAEs are more interpretable, and 3) H-SAEs learn hierarchical semantics. However, I see some issues with these claims: 1. If I understand Figure 3 properly, you compare H-SAEs against SAEs with the same number of top- or total number of

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmbedded Systems Design Techniques