Spectral Bellman Method: Unifying Representation and Exploration in RL
Ofir Nabati, Bo Dai, Shie Mannor, Guy Tennenholtz

TL;DR
The Spectral Bellman Method introduces a spectral framework for representation learning in RL, aligning features with Bellman dynamics to improve exploration and performance in complex tasks.
Contribution
It presents a novel spectral approach based on the Inherent Bellman Error condition, unifying representation and exploration in value-based reinforcement learning.
Findings
Enhanced exploration in hard tasks
Representation learning aligned with Bellman updates
Extension to multi-step Bellman operators
Abstract
Representation learning is critical to the empirical and theoretical success of reinforcement learning. However, many existing methods are induced from model-learning aspects, misaligning them with the RL task in hand. This work introduces the Spectral Bellman Method, a novel framework derived from the Inherent Bellman Error (IBE) condition. It aligns representation learning with the fundamental structure of Bellman updates across a \textit{space} of possible value functions, making it directly suited for value-based RL. Our key insight is a fundamental spectral relationship: under the zero-IBE condition, the transformation of a \textit{distribution} of value functions by the Bellman operator is intrinsically linked to the feature covariance structure. This connection yields a new, theoretically-grounded objective for learning state-action features that capture this Bellman-aligned…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper presents a well-executed idea. The proposed method has solid mathematical backing, and the derived algorithm stems directly from the theoretical insights. It is encouraging to see these insights translate into meaningful empirical gains. - The key insight regarding the spectral properties and the zero-IBE condition is particularly elegant, as it helps reduce a complex optimization problem to a more tractable one. It is a further strength that the Thompson sampling exploration fits we
- The mathematical derivations in Section 3 could be significantly improved for clarity. As written, the section is overly dense. The authors should consider simplifying complex notations (e.g., $\tilde{\theta}(\theta)$) and adding intuitive explanations to make the theoretical statements more comprehensible - The paper would also benefit from a small-scale, illustrative experiment. Such an experiment would be valuable for building intuition and helping isolate the source of the method's benefit
Strength: 1. The paper introduces a novel objective for representation learning by reframing the intractable problem of minimizing Inherent Bellman Error (IBE) into a more tractable proxy based on the spectral properties of the Bellman operator. 2. The framework provides a tight, natural coupling between representation and exploration. The feature covariance matrix learned by SBM is directly used to guide Thompson Sampling, creating a coherent feedback loop where better representations inform m
Major: 1. The theoretical motivation for the SBM objective, particularly the spectral decomposition outlined in Theorem 1, is quite elegant. This derivation hinges on the ideal assumption of zero Inherent Bellman Error (IBE), where the function space is perfectly closed under the Bellman operator. I am curious about how the proposed framework is expected to behave when this assumption is inevitably relaxed in practice, as is the case when using complex neural network approximators. Could the aut
* The work is well motivated, it makes sense to learn representations which directly support exploration/learning better strategies. * The theory is well motivated and sound. Specifically, spectral decomposition under zero IBE conditions and the way to use power iteration for SBM loss optimisation are useful results. * The performance gains are good when SBM + TS is used with baseline algorithms, this is in-particular substantial in the hard exploration games. * The extension to multi-step opera
* Baseline: The paper is missing two critical baselines - DQN + TS or any DQN based method which uses TS (for instance BDQN (Azizzadenesheli et al. 2018)) but does not use spectral features. Similar is the case with R2D2, the work is missing a critical baseline R2D2 + TS without spectral features (which I understand is might be non-trivial to implement but without this, it is very hard to truly guage SBM's effectiveness). This is extremely important to have to be able to gauge the efficacy of SB
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
