StructRL: Recovering Dynamic Programming Structure from Learning Dynamics in Distributional Reinforcement Learning
Ivo Nowak

TL;DR
This paper demonstrates that the learning dynamics in distributional reinforcement learning can reveal and utilize underlying dynamic programming structures to improve sampling and learning efficiency.
Contribution
It introduces a method to recover and exploit dynamic programming-like structure from distributional RL dynamics without explicit models.
Findings
The temporal evolution of return distributions indicates when and where learning occurs.
A temporal learning indicator t*(s) reflects the strongest update in each state.
Using these signals, StructRL guides sampling in line with the learned propagation structure.
Abstract
Reinforcement learning is typically treated as a uniform, data-driven optimization process, where updates are guided by rewards and temporal-difference errors without explicitly exploiting global structure. In contrast, dynamic programming methods rely on structured information propagation, enabling efficient and stable learning. In this paper, we provide evidence that such structure can be recovered from the learning dynamics of distributional reinforcement learning. By analyzing the temporal evolution of return distributions, we identify signals that capture when and where learning occurs in the state space. In particular, we introduce a temporal learning indicator t*(s) that reflects when a state undergoes its strongest learning update during training. Empirically, this signal induces an ordering over states that is consistent with a dynamic programming-style propagation of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
