Distributional value gradients for stochastic environments
Baptiste Debes, Tinne Tuytelaars

TL;DR
This paper introduces Distributional Sobolev Training, a novel method extending distributional reinforcement learning to model value gradients in stochastic environments, improving sample efficiency and robustness.
Contribution
It extends distributional RL to model value gradients using Sobolev spaces, with a new theoretical framework and practical implementation via cVAE and MSMMD.
Findings
Effective on stochastic toy problems
Outperforms baselines on MuJoCo benchmarks
Proves contraction of Sobolev Bellman operator
Abstract
Gradient-regularized value learning methods improve sample efficiency by leveraging learned models of transition dynamics and rewards to estimate return gradients. However, existing approaches, such as MAGE, struggle in stochastic or noisy environments, limiting their applicability. In this work, we address these limitations by extending distributional reinforcement learning on continuous state-action spaces to model not only the distribution over scalar state-action value functions but also over their gradients. We refer to this approach as Distributional Sobolev Training. Inspired by Stochastic Value Gradients (SVG), our method utilizes a one-step world model of reward and transition distributions implemented via a conditional Variational Autoencoder (cVAE). The proposed framework is sample-based and employs Max-sliced Maximum Mean Discrepancy (MSMMD) to instantiate the distributional…
Peer Reviews
Decision·ICLR 2026 Poster
- Extending the idea of distributional RL to action-gradient is interesting. By proposing to learn the distribution of the gradient, the work offers a principled and more complete model of credit assignment under uncertainty. This moves the field beyond the limitations of prior methods like MAGE, which are confined to modeling the gradient's expectation and are thus vulnerable to noise. - The formal derivation of the Sobolev Bellman operator is solid. - Except for theoretical contributions, the
- As the authors stated, one weakness of DSDPG is its computational complexity. - There is a disconnect between the theoretical motivation for the MSMMD metric and its empirical utility. The paper introduces MSMMD primarily to obtain a provable contraction for the Sobolev Bellman operator under a tractable, sample-based metric. Standard MMD does not offer this same general guarantee. However, the empirical results across both the toy problem and the MuJoCo benchmarks show that the MSMMD variant
- This paper makes foundational contributions to the important and timely topic of distributional RL. - It is exceptionally well-written, with both pedagogical clarity and mathematical precision (admittedly, it gets a bit dense at times, but I think that is unavoidable). - The positioning against the Related Work is detailed, clear, and well articulated. Furthermore, it gives a balanced and honest discussion of its limitations. - I appreciate the pedagogical toy problem - The extensive appendic
Honestly, I can't think of any major weaknesses. Of course, it would have been nice to see clear improvements over the baselines on an established benchmark. Still, I think the theoretical and algorithmic contributions by far outweigh the importance of such experimental results. Minor things: - Some of the figures (and especially the text therein) are tiny and require zooming. - As a matter of personal preference, I think it's neater to emphasize text with italics rather than bold font, but I
* The proposed idea is intellectually stimulating and pushes the frontier on how actor critic algorithms work. It presents a novel idea and sound theoretical analysis to justify it. * The proposed practical algorithm takes reasonable approaches for the required theoretical estimates, such as MMD and MSMMD. * The use of the toy domain is useful to study if the proposed technique can deal with increased stochasticity as claimed * The statistical bounds used in the evaluations are heartening and ap
* The use of the VAE is a practical requirement to deal with environments without differentiable dynamics. Using a toy environment where the gradients of the environment are known would have been even better to evaluate how well the idea would work if the dynamics did not have to be estimated. * While the results in the toy domain are impressive, it is unclear why they do not translate as well to the Mujoco environments, where MAGE remains within the margin of error. This difference is not a dea
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Model Reduction and Neural Networks · Robot Manipulation and Learning
