Which Sparse Autoencoder Features Are Real? Model-X Knockoffs for False Discovery Rate Control
Tsogt-Ochir Enkhbayar

TL;DR
This paper introduces a method combining Model-X knockoffs with sparse autoencoders to reliably identify true features and control false discoveries in neural network interpretability, demonstrated on sentiment analysis data.
Contribution
It presents a novel framework integrating Model-X knockoffs with SAE feature selection to control FDR with finite-sample guarantees, enhancing interpretability reliability.
Findings
129 features selected at FDR q=0.1
25% of analyzed latents carry task-relevant signals
5.40x separation in knockoff statistics between selected and non-selected features
Abstract
Although sparse autoencoders (SAEs) are crucial for identifying interpretable features in neural networks, it is still challenging to distinguish between real computational patterns and erroneous correlations. We introduce Model-X knockoffs to SAE feature selection, using knock-off+ to control the false discovery rate (FDR) with finite-sample guarantees under the standard Model-X assumptions (in our case, via a Gaussian surrogate for the latent distribution). We select 129 features at a target FDR q=0.1 after analyzing 512 high-activity SAE latents for sentiment classification using Pythia-70M. About 25% of the latents under examination carry task-relevant signal, whereas 75% do not, according to the chosen set, which displays a 5.40x separation in knockoff statistics compared to non-selected features. Our method offers a re-producible and principled framework for reliable feature…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis · Adversarial Robustness in Machine Learning
