Belief-Based Offline Reinforcement Learning for Delay-Robust Policy Optimization
Simon Sinong Zhan, Qingyuan Wu, Philip Wang, Frank Yang, Xiangyu Shi, Chao Huang, Qi Zhu

TL;DR
This paper introduces DT-CORL, a belief-based offline RL framework that uses transformer models to generate delay-robust actions, effectively bridging the gap between simulation and real-world delayed environments.
Contribution
The paper presents a novel offline RL method that leverages transformers to handle delays during deployment without requiring delayed observations during training.
Findings
DT-CORL outperforms baseline methods in delay-robustness.
It improves sample efficiency compared to naive history augmentation.
It narrows the sim-to-real latency gap in benchmarks.
Abstract
Offline-to-online deployment of reinforcement-learning (RL) agents must bridge two gaps: (1) the sim-to-real gap, where real systems add latency and other imperfections not present in simulation, and (2) the interaction gap, where policies trained purely offline face out-of-distribution states during online execution because gathering new interaction data is costly or risky. Agents therefore have to generalize from static, delay-free datasets to dynamic, delay-prone environments. Standard offline RL learns from delay-free logs yet must act under delays that break the Markov assumption and hurt performance. We introduce DT-CORL (Delay-Transformer belief policy Constrained Offline RL), an offline-RL framework built to cope with delayed dynamics at deployment. DT-CORL (i) produces delay-robust actions with a transformer-based belief predictor even though it never sees delayed observations…
Peer Reviews
Decision·ICLR 2026 Poster
1. Addresses a well-motivated and practical problem at the intersection of offline RL and robustness to delays, which is essentially important in real-world scenarios. 2. The proposed integration of belief learning and policy optimization within an offline constraint framework is technically sound and justified by theoretical analysis. 3. Empirical evaluation is comprehensive, covering multiple tasks, delay types, and delay lengths, with ablations validating design choices. 4. The paper is cl
1. While the transformer belief model shows advantages, its computational overhead relative to simpler models is non-trivial. 2. The experimental validation is confined to standard simulation benchmarks (D4RL). The absence of validation on a physical system or a high-fidelity simulator with realistic latency weakens the claims of practical contribution.
- Interesting problem, well motivated and reasonably well written paper.
- The theory section is based on assumptions that aren't satisfied in practice even in locomotion tasks (defn 3.2), with discontinuities in the dynamics and / or rewards. - The theory doesnt sketch out sample complexity associated with estimation issues popping up due to delayed observations which, in the context of offline RL would be good to get a clear grasp of. - Intuitively, it feels like stochastic delays should be harder to estimate (at least, the estimators would have higher variance), b
1. The problem setting is meaningful, and the approach itself is sound and interesting. 2. The empirical performance of the proposed approach is good across a wide range of tasks.
1. The authors assume a known delay window and construct an augmented dataset based on that delay window. However, assuming a known exact delay window for online deployment is rather unrealistic. If one uses a worst-case delay window for offline training, the proposed approach does not establish if training with a large delay window might still give reasonable online performance if the actual latency during deployment is smaller. 2. The policy improvement statement they constructed seems insuf
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics
