Feature Rivalry in Sparse Autoencoder Representations: A Mechanistic Study of Uncertainty-Driven Feature Competition in LLMs
Harshavardhan

TL;DR
This study introduces Feature Rivalry in sparse autoencoders to understand how uncertainty influences feature competition in large language models, revealing specific stages where uncertainty impacts representations.
Contribution
It demonstrates that feature rivalry correlates with model uncertainty, localizes it within model layers, and can predict answer correctness, offering mechanistic insights into LLM behavior.
Findings
High-entropy questions increase feature rivalry at specific layers.
Steering along rivalry axes influences model outputs.
Rivalry scores predict answer correctness with AUROC 0.689.
Abstract
Sparse Autoencoders (SAEs) decompose large language model representations into interpretable features, but how these features interact under uncertainty remains poorly understood. We introduce Feature Rivalry -- negatively correlated SAE feature pairs -- and study whether rivalry serves as a mechanistic signature of model uncertainty in Gemma-2-2B using Gemma Scope SAEs. Through a controlled within-domain experiment on PopQA split by response entropy, we find that high-entropy questions produce significantly stronger feature rivalry at layers 0 and 12 relative to low-entropy questions (p=5.3x10^-26 and p=5.8x10^-5 respectively), localizing uncertainty to specific processing stages in the residual stream. We then test whether rivalry is causally upstream of model outputs via activation steering along rivalry axes -- finding that steering along the rivalry direction (vec_A - vec_B) causes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
