Sparse Reward Subsystem in Large Language Models
Guowei Xu, Mert Yuksekgonul, James Zou

TL;DR
This paper uncovers a sparse reward subsystem within large language models, identifying specific neurons that encode reward-related information, and demonstrates their robustness and practical applications.
Contribution
It reveals that reward-related information is concentrated in a sparse subset of neurons, forming a reward subsystem analogous to biological systems, with practical applications in model confidence and inference guidance.
Findings
Reward-related information is concentrated in a sparse subset of neurons.
Identified two neuron types: value neurons and dopamine neurons.
Value neurons are robust, transferable, and can predict model confidence.
Abstract
Recent studies show that LLM hidden states encode reward-related information, such as answer correctness and model confidence. However, existing approaches typically fit black-box probes on the full hidden states, offering little insight into how this information is structured across neurons. In this paper, we show that reward-related information is concentrated in a sparse subset of neurons. Using simple probing, we identify two types of neurons: value neurons, whose activations predict state value, and dopamine neurons, whose activations encode step-level temporal difference (TD) errors. Together, these neurons form a sparse reward subsystem within LLM hidden states. These names are drawn by analogy with neuroscience, where value neurons and dopamine neurons in the biological reward subsystem also encode value and reward prediction errors, respectively. We demonstrate that value…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
