Adaptive Reinforcement Learning for Unobservable Random Delays
John Wikman, Alexandre Proutiere, David Broman

TL;DR
This paper introduces a novel framework and algorithm for reinforcement learning agents to adaptively handle unobservable, stochastic, and time-varying delays in real-world environments, improving performance over existing methods.
Contribution
The paper proposes the interaction layer framework and the ACDA algorithm, enabling RL agents to adaptively manage unobservable delays without prior knowledge of delay bounds.
Findings
ACDA outperforms state-of-the-art methods in locomotion benchmarks.
The interaction layer effectively handles unpredictable delays and packet loss.
Adaptive delay management improves RL performance in real-world scenarios.
Abstract
In standard Reinforcement Learning (RL) settings, the interaction between the agent and the environment is typically modeled as a Markov Decision Process (MDP), which assumes that the agent observes the system state instantaneously, selects an action without delay, and executes it immediately. In real-world dynamic environments, such as cyber-physical systems, this assumption often breaks down due to delays in the interaction between the agent and the system. These delays can vary stochastically over time and are typically unobservable, meaning they are unknown when deciding on an action. Existing methods deal with this uncertainty conservatively by assuming a known fixed upper bound on the delay, even if the delay is often much lower. In this work, we introduce the interaction layer, a general framework that enables agents to adaptively and seamlessly handle unobservable and…
Peer Reviews
Decision·Submitted to ICLR 2026
- The problem setting is important, as delays are often stochastic and action delays are not immediately observable. - The paper is well-motivated and provides a good overview of existing work. - The proposed interaction layer, which represents actions as a matrix, is novel.
- A key weakness of the approach is that representing the action packet as a matrix can be computationally and communicationally expensive. The work is motivated by improving efficiency under delays; however, computing the action matrix itself is significantly more time-consuming than generating a single action at each timestep. In particular, while the agent executes only $T$ actions, the framework requires computing and transmitting $T \times L^2$ actions over the network. As a result, many ac
1. The paper tackles an important and practical challenge in reinforcement learning, handling random, unobservable, and time-varying delays, which is highly relevant to real-world systems such as networked and robotic control. 2. The proposed interaction layer is conceptually elegant, bridging MDP and POMDP formulations for delayed environments and offering a general framework for asynchronous agent-environment interactions. 3. The ACDA algorithm integrates a heuristic delay adaptation mechani
1. The method assumes that delays remain approximately constant within short time windows. This assumption may not hold under highly dynamic or non-stationary delay patterns, potentially reducing the method's robustness. 2. The paper does not provide formal convergence analysis, stability proofs, or error bounds. As a result, the robustness of ACDA is supported primarily by empirical evidence rather than theoretical justification.
1. This paper tackles a practical problem in the field of RL with delays, that the delay is random and cannot be observed or predicted. This stands in sharp contrast to the unrealistic priori assumptions about delay found in existing works in this field. The reviewer affirms the contribution of this work. 2. The paper's experimental design is comprehensive. The authors introduce action noise to induce stochasticity into the environments. Furthermore, the results in Appendix E.2 and E.3 serve to
1. The paper's most significant weakness lies in its comparative experimental design, as both the choice of baselines and their setup are unreasonable. The authors selected BPQL and VDPO, which are methods primarily designed for constant delay. However, in random delay environments, providing the worst-case delay as a priori knowledge to these algorithms is, in the reviewer's opinion, an unreasonable setup. The reviewer believes a more reasonable design would be to provide the mean of the random
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Research in Systems and Signal Processing · Innovation Diffusion and Forecasting · Reinforcement Learning in Robotics
