Tree Search-Based Policy Optimization under Stochastic Execution Delay
David Valensi, Esther Derman, Shie Mannor, Gal Dalal

TL;DR
This paper introduces stochastic delayed execution MDPs and a model-based policy optimization algorithm, DEZ, that effectively handles stochastic delays, outperforming baselines in Atari experiments.
Contribution
The paper formalizes stochastic delayed execution MDPs and develops DEZ, a novel algorithm that optimizes policies considering stochastic delays without state augmentation.
Findings
DEZ outperforms baselines in stochastic delay scenarios
Naive methods underperform with stochastic delays
The approach maintains sample efficiency similar to EfficientZero
Abstract
The standard formulation of Markov decision processes (MDPs) assumes that the agent's decisions are executed immediately. However, in numerous realistic applications such as robotics or healthcare, actions are performed with a delay whose value can even be stochastic. In this work, we introduce stochastic delayed execution MDPs, a new formalism addressing random delays without resorting to state augmentation. We show that given observed delay values, it is sufficient to perform a policy search in the class of Markov policies in order to reach optimal performance, thus extending the deterministic fixed delay case. Armed with this insight, we devise DEZ, a model-based algorithm that optimizes over the class of Markov policies. DEZ leverages Monte-Carlo tree search similar to its non-delayed variant EfficientZero to accurately infer future states from the action queue. Thus, it handles…
Peer Reviews
Decision·ICLR 2024 poster
Standard RL assumes immediate availability of information for decision-making, overlooking delays prevalent in various real-world applications. Existing approaches resort to state augmentation, which is inefficient in handling exponential computational complexity and dependence on delay values, hindering its scalability to random delays. To address this issue, the authors propose Delayed EfficientZero, a delayed variant of EfficientZero, which is a model-based algorithm that optimizes over the
1. This paper adopts the ED-MDP formulation of (Derman et al., 2021) that sidesteps state augmentation, and extend it to the random delay case. This extension appears to be a direct application of the previous formulation, authors are expected to explain the technical challenges compared to the constant delay formulation, and be clear about their technical contribution in terms of this formulation. 2. This paper mainly develops based on the ED-MDP formulation of (Derman et al., 2021), random de
**Orignality:"" The approach introduced in the paper is a novel algorithm for a problem framed in a more realistic way than before. **Clarity:** The paper is relatively clear and easy to understand. Some minor tweaks could be useful (see later.) **Significance:** The approach proposed would be of interest to others working MDPs with stochastic delays. **Quality:** The algorithm designed seems reasonable. The theoretical analysis looks sound however I did not thoroughly go through the proofs.
- I may be misunderstanding the graphs, but it looks as if SD-EZ scores worse than the other algorithms in most of the games in the plots in Fig 3a and 3b. Also, there are no confidence intervals or significance testing of any kind. - The algorithmic description is slightly difficult to follow. Perhaps breaking down the data structures (lists, etc.) used into a list would help ease the process.
The authors motivate the problem well. It's easy to understand why stochastic execution delay is an important problem in RL. The contributions are also strong. The extension of ED-MDPs into SED-MDPs seems useful, and Theorem 4.2 is a nice theoretical result that shows Delayed EfficientZero is a well-principled algorithm for them. The results seem promising, given that Figure 5 is correctly labeled and not Figure 3.
There are some clarity issues with the experiments, particularly in Figure 3. It seems mislabeled. Figure 5 in the appendix suggests blue should be SD-EZ, red delayed DQN, and white Oblivious-EZ. It also looks like delays appear from "low to high" rather than from "high to low" as suggested in the caption. It would also be nice for comparison's sake to standardize the figure so that the same game appears on a single row in both columns. Since SED-MDP is a new formalism, it may benefit the paper
Code & Models
Videos
Taxonomy
TopicsCloud Computing and Resource Management · Distributed systems and fault tolerance · Distributed and Parallel Computing Systems
MethodsMonte-Carlo Tree Search
