Suppressing Overestimation in Q-Learning through Adversarial Behaviors
HyeAnn Lee, Donghwan Lee

TL;DR
This paper introduces DAQ, a novel Q-learning algorithm with a dummy adversarial player that effectively reduces overestimation bias by framing learning as a two-player game, improving performance in benchmark tasks.
Contribution
The paper proposes a new adversarial Q-learning framework that unifies and enhances existing methods to suppress overestimation bias in Q-learning algorithms.
Findings
DAQ effectively reduces overestimation bias.
Finite-time convergence of DAQ is theoretically analyzed.
Empirical results show improved performance on benchmarks.
Abstract
The goal of this paper is to propose a new Q-learning algorithm with a dummy adversarial player, which is called dummy adversarial Q-learning (DAQ), that can effectively regulate the overestimation bias in standard Q-learning. With the dummy player, the learning can be formulated as a two-player zero-sum game. The proposed DAQ unifies several Q-learning variations to control overestimation biases, such as maxmin Q-learning and minmax Q-learning (proposed in this paper) in a single framework. The proposed DAQ is a simple but effective way to suppress the overestimation bias thourgh dummy adversarial behaviors and can be easily applied to off-the-shelf reinforcement learning algorithms to improve the performances. A finite-time convergence of DAQ is analyzed from an integrated perspective by adapting an adversarial Q-learning. The performance of the suggested DAQ is empirically…
Peer Reviews
Decision·ICLR 2024 Conference Withdrawn Submission
The paper is well written and easy to follow. The method is described clearly. The discussion of the background literature provides a good motivation for the proposed method, as well as illustrating the connections. The unification viewpoint of minmax and maxmin optimization for Q value is good, connecting them with two-player Markov games is reasonable.
The switching of maxmin operator to minmax Q-learning and adding reward shifts is quite straightforward. The only difference of the minmax/maxmin DAQ with minmax/maxmin Q-learning seems to be the constant reward shift, which is a hyperparameter for performance tuning. The paper does not explain in detail about the choice of shift value, or any method for determining the shift value. What will the performances be affected if the shift values are different? The two DAQ algorithms are proposed w
+ The paper is written fairly clearly and the method is clearly described. + The experiments are quite comprehensive, at least in the toy domains that they are used in. + The finite-time convergence analysis is nice to see, and the proof seems correct.
+ I am not entirely convinced by the theoretical basis for the proposed method. In particular, in section 4.2 it's stated that the addition of $b_i$ doesn't change the optimal policy as it simply adds a constant bias to the Q-value. This is certainly true in the infinite-horizon case, but I don't think it's true in the finite-horizon case, which several (most?) of the experiments are set in. For concreteness, consider a very simple chain MDP like a <- b <- c -> d where a and d are terminal stat
1. The overestimation issue is important and is being studied by many researchers; 2. The paper presents its idea clearly.
The primary concern is in the significance and soundness. Significance. The contribution is incremental. The proposed algorithm is a small addition to two existing methods. Although a small addition does not directly indicate a rejection, I do not see the significance of such addition. Soundness. Note that the primary claim of “designing DAQ to mitigate overestimation” does not hold: the addition itself does not reduce overestimation, at least, the maxmin operator can already have the effect
The paper is generally very easy to follow, and generally has strong performance in the domains tested over. In particular, DAQ does very well compared to other baselines on sparse or negative reward tasks such as Sutton's example and Weng's example. The theoretical claims the paper brings up are solid and seem correct.
I think the paper is missing some experiments that were done in the original minmax Q-learning paper, specifically those on the MinAtar and OpenAI Gym benchmarks. I think more experiments to show how well DAQ works in slightly more scaled-up environments would be a very strong addition to this paper. This is my main concern.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning
MethodsQ-Learning
