Which Sparse Autoencoder Features Are Real? Model-X Knockoffs for False Discovery Rate Control

Tsogt-Ochir Enkhbayar

arXiv:2511.11711·cs.LG·November 18, 2025

Which Sparse Autoencoder Features Are Real? Model-X Knockoffs for False Discovery Rate Control

Tsogt-Ochir Enkhbayar

PDF

Open Access

TL;DR

This paper introduces a method combining Model-X knockoffs with sparse autoencoders to reliably identify true features and control false discoveries in neural network interpretability, demonstrated on sentiment analysis data.

Contribution

It presents a novel framework integrating Model-X knockoffs with SAE feature selection to control FDR with finite-sample guarantees, enhancing interpretability reliability.

Findings

01

129 features selected at FDR q=0.1

02

25% of analyzed latents carry task-relevant signals

03

5.40x separation in knockoff statistics between selected and non-selected features

Abstract

Although sparse autoencoders (SAEs) are crucial for identifying interpretable features in neural networks, it is still challenging to distinguish between real computational patterns and erroneous correlations. We introduce Model-X knockoffs to SAE feature selection, using knock-off+ to control the false discovery rate (FDR) with finite-sample guarantees under the standard Model-X assumptions (in our case, via a Gaussian surrogate for the latent distribution). We select 129 features at a target FDR q=0.1 after analyzing 512 high-activity SAE latents for sentiment classification using Pythia-70M. About 25% of the latents under examination carry task-relevant signal, whereas 75% do not, according to the chosen set, which displays a 5.40x separation in knockoff statistics compared to non-selected features. Our method offers a re-producible and principled framework for reliable feature…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis · Adversarial Robustness in Machine Learning