D2C-HRHR: Discrete Actions with Double Distributional Critics for High-Risk-High-Return Tasks
Jundong Zhang, Yuhui Situ, Fanji Zhang, Rongji Deng, Tianqi Wei

TL;DR
This paper introduces a novel reinforcement learning framework for high-risk-high-return tasks that models multimodal action distributions using discretization and a dual-critic architecture, leading to improved performance in complex control tasks.
Contribution
It proposes a new RL approach that discretizes actions, uses entropy regularization, and employs dual critics to better handle multimodal and risky actions in high-dimensional spaces.
Findings
Outperforms baseline methods on locomotion and manipulation benchmarks
Explicit modeling of multimodality improves risk management in RL
Discretization enables effective approximation of complex action distributions
Abstract
Tasks involving high-risk-high-return (HRHR) actions, such as obstacle crossing, often exhibit multimodal action distributions and stochastic returns. Most reinforcement learning (RL) methods assume unimodal Gaussian policies and rely on scalar-valued critics, which limits their effectiveness in HRHR settings. We formally define HRHR tasks and theoretically show that Gaussian policies cannot guarantee convergence to the optimal solution. To address this, we propose a reinforcement learning framework that (i) discretizes continuous action spaces to approximate multimodal distributions, (ii) employs entropy-regularized exploration to improve coverage of risky but rewarding actions, and (iii) introduces a dual-critic architecture for more accurate discrete value distribution estimation. The framework scales to high-dimensional action spaces, supporting complex control domains. Experiments…
Peer Reviews
Decision·Submitted to ICLR 2026
Clear problem framing. The HRHR definition crisply captures when the average return of a risky region is lower despite having the global maximum, and the theorem illustrates why a Gaussian policy with variance larger than the grain drifts toward safer suboptimal regions. This directly motivates discretization. Method simplicity & compatibility. The per-dimension discretization with a matrix policy is easy to implement and plug into existing actor-critic setups; the double distributional criti
1. The multidimensional discrete actor assumes independence across action dimensions (row-wise sampling). While this avoids an explicit m^n enumeration, it may miss inter-dimensional couplings crucial in dexterous manipulation or legged control with coupled joints. Please discuss when factorization suffices, and whether an autoregressive or flow-based discrete policy could capture dependencies without exploding compute. 2. Discretization increases output size (n×m) and critic heads (distributi
- HRHR is an interesting problem formulation, and makes Gaussian policies fail. - Empirical results in two environments are strong.
- Limited domains that clearly demonstrate the HRHR framework. The two primary environments shown are limited and also not exactly falling under the domain of HRHR. The paper would benefit from several classes of environments where HRHR is a meaningful problem. - Even in HRHR, Gaussian policies may be optimal because the policy can still optimize one of the peaks of the Q-function. It is unclear what is the key failure mode of policy learning in multimodal Q-functions? - How does this paper comp
The paper shows experimental results in complex tasks (bipedal robots) and the results are significantly better than other approaches.
- Unfortunately, the paper is not well written, with many distracting typos, but also syntax errors like icomplete sentences. The most common is distributed versus distributional but there are others like critic vs criticism. Even if I try to ignore these typos, I can't find clear justifications for why the algorithm performs better than others in the experimental setup. - The term 'risk' may not be appropriate, if we think that it is used in other topics like Conditional Value at Risk, or risk
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Robot Manipulation and Learning · Adversarial Robustness in Machine Learning
