Belief-Based Offline Reinforcement Learning for Delay-Robust Policy Optimization

Simon Sinong Zhan; Qingyuan Wu; Philip Wang; Frank Yang; Xiangyu Shi; Chao Huang; Qi Zhu

arXiv:2506.00131·cs.LG·February 12, 2026

Belief-Based Offline Reinforcement Learning for Delay-Robust Policy Optimization

Simon Sinong Zhan, Qingyuan Wu, Philip Wang, Frank Yang, Xiangyu Shi, Chao Huang, Qi Zhu

PDF

Open Access 3 Reviews

TL;DR

This paper introduces DT-CORL, a belief-based offline RL framework that uses transformer models to generate delay-robust actions, effectively bridging the gap between simulation and real-world delayed environments.

Contribution

The paper presents a novel offline RL method that leverages transformers to handle delays during deployment without requiring delayed observations during training.

Findings

01

DT-CORL outperforms baseline methods in delay-robustness.

02

It improves sample efficiency compared to naive history augmentation.

03

It narrows the sim-to-real latency gap in benchmarks.

Abstract

Offline-to-online deployment of reinforcement-learning (RL) agents must bridge two gaps: (1) the sim-to-real gap, where real systems add latency and other imperfections not present in simulation, and (2) the interaction gap, where policies trained purely offline face out-of-distribution states during online execution because gathering new interaction data is costly or risky. Agents therefore have to generalize from static, delay-free datasets to dynamic, delay-prone environments. Standard offline RL learns from delay-free logs yet must act under delays that break the Markov assumption and hurt performance. We introduce DT-CORL (Delay-Transformer belief policy Constrained Offline RL), an offline-RL framework built to cope with delayed dynamics at deployment. DT-CORL (i) produces delay-robust actions with a transformer-based belief predictor even though it never sees delayed observations…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. Addresses a well-motivated and practical problem at the intersection of offline RL and robustness to delays, which is essentially important in real-world scenarios. 2. The proposed integration of belief learning and policy optimization within an offline constraint framework is technically sound and justified by theoretical analysis. 3. Empirical evaluation is comprehensive, covering multiple tasks, delay types, and delay lengths, with ablations validating design choices. 4. The paper is cl

Weaknesses

1. While the transformer belief model shows advantages, its computational overhead relative to simpler models is non-trivial. 2. The experimental validation is confined to standard simulation benchmarks (D4RL). The absence of validation on a physical system or a high-fidelity simulator with realistic latency weakens the claims of practical contribution.

Reviewer 02Rating 2Confidence 4

Strengths

- Interesting problem, well motivated and reasonably well written paper.

Weaknesses

- The theory section is based on assumptions that aren't satisfied in practice even in locomotion tasks (defn 3.2), with discontinuities in the dynamics and / or rewards. - The theory doesnt sketch out sample complexity associated with estimation issues popping up due to delayed observations which, in the context of offline RL would be good to get a clear grasp of. - Intuitively, it feels like stochastic delays should be harder to estimate (at least, the estimators would have higher variance), b

Reviewer 03Rating 4Confidence 4

Strengths

1. The problem setting is meaningful, and the approach itself is sound and interesting. 2. The empirical performance of the proposed approach is good across a wide range of tasks.

Weaknesses

1. The authors assume a known delay window and construct an augmented dataset based on that delay window. However, assuming a known exact delay window for online deployment is rather unrealistic. If one uses a worst-case delay window for offline training, the proposed approach does not establish if training with a large delay window might still give reasonable online performance if the actual latency during deployment is smaller. 2. The policy improvement statement they constructed seems insuf

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics