TL;DR
This paper introduces AdaptiveK SAE, a dynamic sparsity autoencoder that adjusts feature allocation based on input complexity, improving interpretability and reconstruction quality in large language models.
Contribution
It presents a novel framework that dynamically allocates sparsity in autoencoders guided by semantic complexity signals from LLMs, advancing interpretability methods.
Findings
Outperforms fixed-sparsity methods on multiple metrics
Effectively encodes context complexity linearly in LLM representations
Reduces hyperparameter tuning and computational costs
Abstract
Understanding the internal representations of large language models (LLMs) remains a central challenge for interpretability research. Sparse autoencoders (SAEs) offer a promising solution by decomposing activations into interpretable features, but existing approaches rely on fixed sparsity constraints that fail to account for input complexity. We propose AdaptiveK SAE (Adaptive Top K Sparse Autoencoders), a novel framework that dynamically adjusts sparsity levels based on the semantic complexity of each input. Leveraging linear probes, we demonstrate that context complexity is linearly encoded in LLM representations, and we use this signal to guide feature allocation during training. Experiments across ten language models (from 70M to 14B parameters) demonstrate that this complexity-driven adaptation significantly outperforms fixed-sparsity approaches on reconstruction fidelity,…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- Novel direction: relating feature activation quantity to text complexity, potentially refining SAE sparsity allocation. - The use of a linear probe to estimate complexity offers a simple and scalable mechanism. - Experiments cover a wide range of LLM scales and provide both reconstruction and interpretability analyses. - The idea could inform future work on dynamic sparsity or resource allocation in interpretability models.
The paper’s core idea is conceptually meaningful and experimentally supported, but: - The connection between k_{\text{adp}} and the complexity score (Eq. 7) is heuristic and lacks clear motivation or theoretical justification. - The explained variance (EV) formula in line 395 is incorrect, and this error appears to propagate into key results (e.g., Pareto frontier in Fig. 6). You can refer to https://arxiv.org/pdf/2410.20526 for the right one. - Unexpectedly poor baseline performance: In Figure
- Figure 2 is well presented - On the Pareto plots their approach does indeed seem to outperform the previous approaches. It would be nice to see a Pareto frontier from their approach - e.g. as you adjust the average sparsity how does their approach vary compared to other approaches? - The approach that the authors look at is intuitively interesting - If the authors claim that less hyperparameter tuning is required for their approach is true then this is a great quality of this approach
- Clarity of written exposition is quite low with several sections which were not well explained. Section 2.1 in particular would be hard to follow for someone not familiar with the literature - Figure 1 is difficult to interpret is the error fixed for all of the SAEs presented? - If so then their approach is outperformed by others. - If not then having fewer activated features is not a virtue if it could be that the error of their approach is higher. Here we really need a Pareto plot
1. The paper’s core idea (using the model’s own representation of complexity to guide the degree of sparsity) is remarkably clever. It fills a major gap in the literature using a promising approach. The conceptual contribution here is quite large. 2. The evaluation of whether LLMs represent complexity linearly seems very well-executed, I appreciate the completeness of e.g. section 3.1.2. 3. The paper does a wide range of analysis, including evaluations with SAEBench, comparisons to many base
Major concerns (Addressing these would make me increase my score. The idea of the paper is good, but the evaluations need to improve) 1. The authors claim that one of the major benefits of their approach is that it removes the necessity of expensive hyperparameter sweeps to find the right value of k. However, in equation 7 they introduce at least 2 new hyperparameters, k\_min and k\_max. (c\_min and c\_max may also be intended as hyperparameters, though I imagine they could be set automatically
- impressive results on FVU/L0 and CosSim/L0 and Loss/L0, which improves the pareto frontier by a lot, and moves the paper beyond incremental improvement territory - clear motivation - well placed in the literature, as all the methods this paper beats shows
Methodological: - gpt4.1 mini is a weak way to score complexity, and you train the probes on gpt4.1 mini's complexity estimates. - only studies classic SAEs and not other variants like cross coders - % of interpretable features is omitted (only a case study is provided), and this is an issue because you can't control sparsity (which is tied to interpretability) to help in case your interpretability is low If the above are addressed I will raise my score to 8. Style: - no quantified improvemen
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
