Learning in complex action spaces without policy gradients
Arash Tavakoli, Sina Ghiassian, Nemanja Raki\'cevi\'c

TL;DR
This paper challenges the belief that policy gradient methods are inherently better for complex action spaces by introducing a framework that enables action-value methods to perform similarly without policy gradients.
Contribution
The authors propose a universal principles-based framework for action-value methods, exemplified by QMLE, that performs well in complex action spaces without policy gradients.
Findings
QMLE performs comparably to policy gradient methods in complex spaces
QMLE achieves strong results on DeepMind Control Suite
The framework broadens applicability of action-value methods
Abstract
While conventional wisdom holds that policy gradient methods are better suited to complex action spaces than action-value methods, foundational work has shown that the two paradigms are equivalent in small, finite action spaces (O'Donoghue et al., 2017; Schulman et al., 2017a). This raises the question of why their computational applicability and performance diverge as the complexity of the action space increases. We hypothesize that the apparent superiority of policy gradients in such settings stems not from intrinsic qualities of the paradigm but from universal principles that can also be applied to action-value methods, enabling similar functions. We identify three such principles and provide a framework for incorporating them into action-value methods. To support our hypothesis, we instantiate this framework in what we term QMLE, for Q-learning with maximum likelihood estimation.…
Peer Reviews
Decision·Submitted to ICLR 2025
- The paper presents a novel perspective by challenging the conventional wisdom that policy gradient methods are inherently superior for complex action spaces, opening up new avenues for research and exploration. - The introduction of the Q-learning with Maximum Likelihood Estimation (QMLE) framework effectively integrates principles traditionally associated with policy gradients into action-value learning, showcasing versatility and adaptability. - The authors provide robust empirical results d
- While the paper introduces the QMLE framework, it lacks in-depth theoretical analysis or proof of convergence properties, which could strengthen the foundational understanding of the proposed method. - The empirical evaluations are primarily conducted in the DeepMind Control Suite, which may not fully represent the challenges and complexities found in more diverse real-world environments, limiting the generalizability of the findings. - The paper could benefit from a more thorough comparison w
1. The identification of the three core principles is noteworthy and has the potential to influence future research on policy gradient methods, even considering that the paper's goal is to provide alternatives to traditional policy gradients. 2. The experimental setup appears sound, with comparisons against several baselines across multiple seeds, and the results generally favor the proposed method. However, I would suggest including an ablation study (see Point 3 below).
1. My major concern is with the overall goal of the paper, as its central premise is unclear to me. First, what is the issue with policy gradients that the authors are aiming to replace them? This should have been clarified to justify the need for an alternative approach. If the goal is to develop a method that is simpler (i.e., fewer components, reduced computation, fewer hyperparameters) than policy gradients, similar to the approach in [1], I would argue that the proposed method is in fact mo
1, The proposed principles underlying the scalability of policy gradient methods are intriguing observations, and a careful analysis of the gap between the two paradigms is insightful. 2, The presentation of the ideas proposed is clear. 3, Experiments were conducted across multiple domains, although without significant improvement.
1, There is no theoretical guarantees for computing the maximization in A_m instead of A(eq.17). (In policy gradient methods, using an MC estimator in place of exact summation or integration has theoretical foundations.) 2, The proposed method trains all the predictors from historical argmax approximation to construct a small action space A_m for computing an approximation of best action. The iterative dependency restricted the actions to a subset where most actions are similar, potential
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics
MethodsAdam · Batch Normalization · Prioritized Experience Replay · N-step Returns · Distributed Distributional DDPG · Q-Learning
