Superposition as Lossy Compression: Measure with Sparse Autoencoders and Connect to Adversarial Vulnerability
Leonard Bereska, Zoe Tzifa-Kratira, Reza Samavi, Efstratios Gavves

TL;DR
This paper introduces an information-theoretic measure using sparse autoencoders to quantify superposition in neural networks, revealing its relationship with adversarial vulnerability and network capacity.
Contribution
It presents a novel metric for measuring superposition as lossy compression, linking it to network robustness and capacity, and challenges existing assumptions about adversarial vulnerability.
Findings
The metric correlates well with ground truth in toy models.
Superposition reduces under dropout and during feature consolidation.
Adversarial training can increase effective features and robustness.
Abstract
Neural networks achieve remarkable performance through superposition: encoding multiple features as overlapping directions in activation space rather than dedicating individual neurons to each feature. This challenges interpretability, yet we lack principled methods to measure superposition. We present an information-theoretic framework measuring a neural representation's effective degrees of freedom. We apply Shannon entropy to sparse autoencoder activations to compute the number of effective features as the minimum neurons needed for interference-free encoding. Equivalently, this measures how many "virtual neurons" the network simulates through superposition. When networks encode more effective features than actual neurons, they must accept interference as the price of compression. Our metric strongly correlates with ground truth in toy models, detects minimal superposition in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis · Explainable Artificial Intelligence (XAI)
