Mind Your Entropy: From Maximum Entropy to Trajectory Entropy-Constrained RL
Guojian Zhan, Likun Wang, Pengcheng Wang, Feihong Zhang, Jingliang Duan, Masayoshi Tomizuka, Shengbo Eben Li

TL;DR
This paper introduces a trajectory entropy-constrained reinforcement learning framework that improves stability and performance by separately learning reward and entropy Q-functions, enabling long-term entropy control.
Contribution
It proposes a novel TECRL framework with separate Q-functions for reward and entropy, addressing non-stationary Q-value estimation and short-sighted entropy tuning in maximum entropy RL.
Findings
DSAC-E achieves higher returns on benchmarks.
Enhanced stability over existing methods.
Effective long-term entropy control.
Abstract
Maximum entropy has become a mainstream off-policy reinforcement learning (RL) framework for balancing exploitation and exploration. However, two bottlenecks still limit further performance improvement: (1) non-stationary Q-value estimation caused by jointly injecting entropy and updating its weighting parameter, i.e., temperature; and (2) short-sighted local entropy tuning that adjusts temperature only according to the current single-step entropy, without considering the effect of cumulative entropy over time. In this paper, we extends maximum entropy framework by proposing a trajectory entropy-constrained reinforcement learning (TECRL) framework to address these two challenges. Within this framework, we first separately learn two Q-functions, one associated with reward and the other with entropy, ensuring clean and stable value targets unaffected by temperature updates. Then, the…
Peer Reviews
Decision·Submitted to ICLR 2026
- The idea of completely decoupling the reward and entropy value streams into two separate critics ( $Q_r$ and $Q_e$ ) is a novel and clean architectural approach. - The proposed algorithm, DSAC-E, demonstrates state-of-the-art performance on a suite of standard MuJoCo benchmarks, consistently outperforming its direct predecessor, DSAC-T, as well as SAC and other baselines. - The ablation study in Table 2 effectively isolates the performance contributions of the two main components (RES and TEC)
1. The paper's entire motivation rests on solving two "bottlenecks," but the justification for their existence and severity is weak and not supported by evidence. - Non-stationary Q-value: The paper asserts that updating alpha makes the Q-target non-stationary. While true, this is a minor effect compared to the policy pi and Q-function Q themselves being updated, which is the primary source of non-stationarity in all bootstrapped RL. The paper provides no empirical evidence (e.g., plots of targe
- The proposed technique is simple and clearly presented, with writing that is easy to follow. - TECRL offers significant performance gains compared to the DSAC-T baseline, especially for certain control tasks such as Humanoid, Ant, and Walker2d.
- The method introduces an additional hyperparameter, $\rho$, which appears to require environment-specific tuning (e.g., $\rho=20$ for Humanoid/Walker2d vs. $\rho=1$ elsewhere), increasing tuning complexity. - The theoretical investigation is limited. The “performance bound” (Sec. 3.3) fixes $\alpha_\text{soft}^{\*}$ from the MaxEnt optimum and algebraically relates return to $\mathcal{H}^{\*}\_{\text{soft}} - \mathcal{H}\_{\text{budget}}$. However, it doesn’t yield a constructive guarantee f
1. The idea of separating the critic into reward-centric and entropy-centric components is intuitive and conceptually clear. 2. The trajectory-level entropy constraint provides a novel perspective that could inspire future work on entropy-based control. 3. The empirical results show noticeable improvements on several MuJoCo environments compared to strong baselines.
- Unclear theoretical analysis: The analysis in Subsection 3.3 is not sufficiently clear. It is not convincing that the trajectory entropy budget necessarily leads to a higher performance bound. The metric used to define the "performance bound" should be explicitly stated. Equation (18) alone does not imply that enforcing an entropy budget guarantees performance improvement (the inequality logic C <= A+B does not lead to C >= A). - Inconsistent experimental results: Some results differ from t
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Adversarial Robustness in Machine Learning · Adaptive Dynamic Programming Control
