Mitigating Suboptimality of Deterministic Policy Gradients in Complex Q-functions

Ayush Jain; Norio Kosaka; Xinhu Li; Kyung-Min Kim; Erdem B{\i}y{\i}k; Joseph J. Lim

arXiv:2410.11833·cs.LG·October 13, 2025

Mitigating Suboptimality of Deterministic Policy Gradients in Complex Q-functions

Ayush Jain, Norio Kosaka, Xinhu Li, Kyung-Min Kim, Erdem B{\i}y{\i}k, Joseph J. Lim

PDF

Open Access 3 Reviews

TL;DR

This paper introduces SAVO, an actor architecture for reinforcement learning that generates multiple actions and refines Q-function approximation to overcome local optima issues in complex tasks.

Contribution

SAVO is a novel actor architecture that improves deterministic policy gradients by handling local optima through multiple proposals and Q-function truncation.

Findings

01

SAVO outperforms existing architectures in complex tasks.

02

It finds optimal actions more frequently.

03

Demonstrates effectiveness in diverse environments.

Abstract

In reinforcement learning, off-policy actor-critic methods like DDPG and TD3 use deterministic policy gradients: the Q-function is learned from environment data, while the actor maximizes it via gradient ascent. We observe that in complex tasks such as dexterous manipulation and restricted locomotion with mobility constraints, the Q-function exhibits many local optima, making gradient ascent prone to getting stuck. To address this, we introduce SAVO, an actor architecture that (i) generates multiple action proposals and selects the one with the highest Q-value, and (ii) approximates the Q-function repeatedly by truncating poor local optima to guide gradient ascent more effectively. We evaluate tasks such as restricted locomotion, dexterous manipulation, and large discrete-action space recommender systems and show that our actor finds optimal actions more frequently and outperforms…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 2

Strengths

- Ensemble of Actors: The ensemble-style actor, which combines policies and selects actions based on the highest Q-values, is novel afaik. - The surrogate Q-functions inspired by tabu search could potentially improve gradient-based optimization by reducing the number of local optima in the action space. - The convergence proof for the maximizer actor in the tabular setting has been provided, although I did not have time to check in-depth.

Weaknesses

- Quality of Action Proposals: I'm wondering whether the quality of action proposals from additional policies $v_i$ could affect the effectiveness of the maximizer actor? If these proposals are not close to the optimal actions, the maximizer actor’s effectiveness is limited, as it can only select from the actions provided. In other words, the performance of the maximizer actor is capped by the quality of these action proposals. - The effect of smoothing in highly dynamical environments : In rap

Reviewer 02Rating 5Confidence 4

Strengths

The paper is easy to read, and generally well written and is well motivated by the issue of converging to a local optima. The idea of restricting local optimal using tabu search in this successive manner is very interesting. The idea appears fairly novel. The closest work to this is probably SAC, although this paper is quite a bit different in motivation. This helps with novelty.

Weaknesses

### Weaknesses 1. The complexity of the true underlying Q function should affect how many surrogates are needed. There should be an analysis or heuristic on how to determine the optimal or necessary number of surrogates needed for a particular task. This will also determine the computational viability of SAVO, in situations where we might require several surrogates to converge to global optima. 2. The experiment environments are quite simplistic with only one set of 3D environments. There need

Reviewer 03Rating 8Confidence 3

Strengths

1. The paper has a clear motivation for the challenge of maximization Q-function in commonly used deterministic policy gradient algorithms. Figures 1-3 are all strong evidence of this problem. 2. The paper is well-written. Especially for the transition from Section 3 to Section 4, the connection is quite natural. The introduction of the main method of SAVO is also clear. 3. The extensive experiments (5 environments) also strongly prove the effectiveness of the SAVE architecture. 4. It is very

Weaknesses

1. The surrogate value function also requires a smooth version to improve performance. Is it also possible to add smoothness to the origianl Q-value? It would be helpful to add it as a baseline.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRisk and Portfolio Optimization

Methods*Communicated@Fast*How Do I Communicate to Expedia? · Experience Replay · Dense Connections · Target Policy Smoothing · Clipped Double Q-learning · Adam · Batch Normalization · Weight Decay · Convolution · Deep Deterministic Policy Gradient