From Atoms to Trees: Building a Structured Feature Forest with Hierarchical Sparse Autoencoders
Yifan Luo, Yang Zhan, Jiedong Jiang, Tianyang Liu, Mingrui Wu, Zhennan Zhou, Bin Dong

TL;DR
This paper introduces Hierarchical Sparse Autoencoders (HSAE), a novel method for uncovering hierarchical, semantically meaningful structures in large language models, enhancing interpretability without sacrificing reconstruction quality.
Contribution
HSAE jointly learns feature hierarchies and parent-child relationships, using structural constraints and perturbations, advancing the analysis of LLM internal representations.
Findings
HSAE reliably recovers semantic hierarchies across models and layers.
HSAE maintains high reconstruction fidelity and interpretability.
Qualitative and quantitative evaluations validate HSAE's effectiveness.
Abstract
Sparse autoencoders (SAEs) have proven effective for extracting monosemantic features from large language models (LLMs), yet these features are typically identified in isolation. However, broad evidence suggests that LLMs capture the intrinsic structure of natural language, where the phenomenon of "feature splitting" in particular indicates that such structure is hierarchical. To capture this, we propose the Hierarchical Sparse Autoencoder (HSAE), which jointly learns a series of SAEs and the parent-child relationships between their features. HSAE strengthens the alignment between parent and child features through two novel mechanisms: a structural constraint loss and a random feature perturbation mechanism. Extensive experiments across various LLMs and layers demonstrate that HSAE consistently recovers semantically meaningful hierarchies, supported by both qualitative case studies and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Computational and Text Analysis Methods · Domain Adaptation and Few-Shot Learning
