Incorporating Hierarchical Semantics in Sparse Autoencoder Architectures
Mark Muchane, Sean Richardson, Kiho Park, Victor Veitch

TL;DR
This paper introduces a hierarchical semantic model within sparse autoencoders, enhancing interpretability, reconstruction, and efficiency by explicitly capturing semantic relationships among concepts.
Contribution
It presents a novel SAE architecture that models semantic hierarchies, improving interpretability and computational efficiency in learned representations.
Findings
Semantic hierarchies can be learned within large language models.
The new architecture improves reconstruction accuracy.
Significant computational efficiency gains are achieved.
Abstract
Sparse dictionary learning (and, in particular, sparse autoencoders) attempts to learn a set of human-understandable concepts that can explain variation on an abstract space. A basic limitation of this approach is that it neither exploits nor represents the semantic relationships between the learned concepts. In this paper, we introduce a modified SAE architecture that explicitly models a semantic hierarchy of concepts. Application of this architecture to the internal representations of large language models shows both that semantic hierarchy can be learned, and that doing so improves both reconstruction and interpretability. Additionally, the architecture leads to significant improvements in computational efficiency.
Peer Reviews
Decision·Submitted to ICLR 2026
- Hierarchical encoding is a very interesting topic and an important component of modeling LLM representations. Hierarchical structure is inherently quite interpretable, and intuitively resolves many issues with existing SAEs (such as splitting and absorption). - Additionally, this work is grounded in existing literature on the hierarchy of representations in language models, and the MoE architecture is an intuitive operationalization of hierarchies. - The results are promising, particularly for
- Given that for each high-level feature, there are 16 or 64 sub-features, shouldn’t the authors be comparing to standard SAEs with 16x or 64x the number of normal latents (essentially the size of all of the sub-features from HSAEs combined) to ensure overall width is the same? If they are already doing this, please ignore this comment but clarify this point in the paper. - Why not always compare to both TopK and Matryoshka SAEs? For each experiment, the authors only use one or the other as the
The core idea of using a hierarchical, MoE-style architecture for SAEs is simple but practical. The gain in computational efficiency is a significant result on its own, potentially unblocking efforts to scale interpretability tools to frontier models. The experimental validation is quite thorough. The authors combined standard reconstruction loss with metrics for downstream task performance like CE loss, as well as feature absorption and cross-lingual feature redundancy. The direct comparison t
One limitation is that proposed architecture implements a two-level hierarchy (parent/child). Real-world semantics are often deeper. The paper doesn't discuss the limitations of this two-level structure or how the architecture might be extended to model deeper, more complex hierarchies. The subspace dimension $s$ is a key hyperparameter, set to 4 or 8. While this is motivated by the low-rank finding from prior work and benefits efficiency, there is no sensitivity analysis or further justificati
* The authors provide a novel architecture and approach grounded in theories of hierarchical concept representations in LLMs * The method is simple and intuitive * Empirical results highlight the efficacy of the method in terms of reconstruction metrics as well as more nuanced benchmarks such as feature absorption and feature universality across language. * I think the absorption experiment is a great way to demonstrate the efficacy of this method, as it intuitively (and empirically) seems cl
(Apologies for any clarity issues, I bounce between saying high- and low-, top- and low-, etc. to describe your hierarchy of features). The authors argue that this architecture is more useful for three primary reasons: 1) H-SAEs have better reconstruction, 2) H-SAEs are more interpretable, and 3) H-SAEs learn hierarchical semantics. However, I see some issues with these claims: 1. If I understand Figure 3 properly, you compare H-SAEs against SAEs with the same number of top- or total number of
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmbedded Systems Design Techniques
