D2C-HRHR: Discrete Actions with Double Distributional Critics for High-Risk-High-Return Tasks

Jundong Zhang; Yuhui Situ; Fanji Zhang; Rongji Deng; Tianqi Wei

arXiv:2510.17212·cs.LG·October 21, 2025

D2C-HRHR: Discrete Actions with Double Distributional Critics for High-Risk-High-Return Tasks

Jundong Zhang, Yuhui Situ, Fanji Zhang, Rongji Deng, Tianqi Wei

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a novel reinforcement learning framework for high-risk-high-return tasks that models multimodal action distributions using discretization and a dual-critic architecture, leading to improved performance in complex control tasks.

Contribution

It proposes a new RL approach that discretizes actions, uses entropy regularization, and employs dual critics to better handle multimodal and risky actions in high-dimensional spaces.

Findings

01

Outperforms baseline methods on locomotion and manipulation benchmarks

02

Explicit modeling of multimodality improves risk management in RL

03

Discretization enables effective approximation of complex action distributions

Abstract

Tasks involving high-risk-high-return (HRHR) actions, such as obstacle crossing, often exhibit multimodal action distributions and stochastic returns. Most reinforcement learning (RL) methods assume unimodal Gaussian policies and rely on scalar-valued critics, which limits their effectiveness in HRHR settings. We formally define HRHR tasks and theoretically show that Gaussian policies cannot guarantee convergence to the optimal solution. To address this, we propose a reinforcement learning framework that (i) discretizes continuous action spaces to approximate multimodal distributions, (ii) employs entropy-regularized exploration to improve coverage of risky but rewarding actions, and (iii) introduces a dual-critic architecture for more accurate discrete value distribution estimation. The framework scales to high-dimensional action spaces, supporting complex control domains. Experiments…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 5

Strengths

Clear problem framing. The HRHR definition crisply captures when the average return of a risky region is lower despite having the global maximum, and the theorem illustrates why a Gaussian policy with variance larger than the grain drifts toward safer suboptimal regions. This directly motivates discretization. Method simplicity & compatibility. The per-dimension discretization with a matrix policy is easy to implement and plug into existing actor-critic setups; the double distributional criti

Weaknesses

1. The multidimensional discrete actor assumes independence across action dimensions (row-wise sampling). While this avoids an explicit m^n enumeration, it may miss inter-dimensional couplings crucial in dexterous manipulation or legged control with coupled joints. Please discuss when factorization suffices, and whether an autoregressive or flow-based discrete policy could capture dependencies without exploding compute. 2. Discretization increases output size (n×m) and critic heads (distributi

Reviewer 02Rating 4Confidence 4

Strengths

- HRHR is an interesting problem formulation, and makes Gaussian policies fail. - Empirical results in two environments are strong.

Weaknesses

- Limited domains that clearly demonstrate the HRHR framework. The two primary environments shown are limited and also not exactly falling under the domain of HRHR. The paper would benefit from several classes of environments where HRHR is a meaningful problem. - Even in HRHR, Gaussian policies may be optimal because the policy can still optimize one of the peaks of the Q-function. It is unclear what is the key failure mode of policy learning in multimodal Q-functions? - How does this paper comp

Reviewer 03Rating 2Confidence 3

Strengths

The paper shows experimental results in complex tasks (bipedal robots) and the results are significantly better than other approaches.

Weaknesses

- Unfortunately, the paper is not well written, with many distracting typos, but also syntax errors like icomplete sentences. The most common is distributed versus distributional but there are others like critic vs criticism. Even if I try to ignore these typos, I can't find clear justifications for why the algorithm performs better than others in the experimental setup. - The term 'risk' may not be appropriate, if we think that it is used in other topics like Conditional Value at Risk, or risk

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Robot Manipulation and Learning · Adversarial Robustness in Machine Learning