Sparsity as a Key: Unlocking New Insights from Latent Structures for Out-of-Distribution Detection
Ahyoung Oh, Wonseok Shin, Songkuk Kim

TL;DR
This paper introduces a novel SAE-based framework for out-of-distribution detection in Vision Transformers, revealing class-specific activation patterns and a structural invariant that improves detection performance.
Contribution
First application of Sparse Autoencoders to ViT [CLS] tokens for OOD detection, uncovering class-specific activation patterns and a divergence-based scoring method.
Findings
Achieves strong FPR95 results across benchmarks.
Reveals stable class-specific activation patterns in in-distribution data.
Disruptions in activation patterns indicate OOD samples.
Abstract
Sparse Autoencoders (SAEs) have demonstrated significant success in interpreting Large Language Models (LLMs) by decomposing dense representations into sparse, semantic components. However, their potential for analyzing Vision Transformers (ViTs) remains largely under-explored. In this work, we present the first application of SAEs to the ViT [CLS] token for out-of-distribution (OOD) detection, addressing the limitation of existing methods that rely on entangled feature representations. We propose a novel framework utilizing a Top-k SAE to disentangle the dense [CLS] features into a structured latent space. Through this analysis, we reveal that in-distribution (ID) data exhibits consistent, class-specific activation patterns, which we formalize as Class Activation Profiles (CAPs). Our study uncovers a key structural invariant: while ID samples preserve a stable pattern within CAPs, OOD…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
