Dual-Robust Cross-Domain Offline Reinforcement Learning Against Dynamics Shifts

Zhongjian Qiao; Rui Yang; Jiafei Lyu; Xiu Li; Zhongxiang Dai; Zhuoran Yang; Siyang Gao; Shuang Qiu

arXiv:2512.02486·cs.LG·March 10, 2026

Dual-Robust Cross-Domain Offline Reinforcement Learning Against Dynamics Shifts

Zhongjian Qiao, Rui Yang, Jiafei Lyu, Xiu Li, Zhongxiang Dai, Zhuoran Yang, Siyang Gao, Shuang Qiu

PDF

Open Access 3 Reviews

TL;DR

This paper introduces DROCO, a novel offline RL algorithm that enhances robustness against dynamics shifts during both training and testing, addressing a key gap in cross-domain offline RL.

Contribution

The paper proposes a dual-robust framework with a new Bellman operator and techniques to improve test-time robustness in cross-domain offline RL.

Findings

01

DROCO outperforms baseline methods in various dynamics shift scenarios.

02

The RCB operator improves test-time robustness without sacrificing train-time stability.

03

Techniques like dynamic value penalty and Huber loss mitigate value estimation errors.

Abstract

Single-domain offline reinforcement learning (RL) often suffers from limited data coverage, while cross-domain offline RL handles this issue by leveraging additional data from other domains with dynamics shifts. However, existing studies primarily focus on train-time robustness (handling dynamics shifts from training data), neglecting the test-time robustness against dynamics perturbations when deployed in practical scenarios. In this paper, we investigate dual (both train-time and test-time) robustness against dynamics shifts in cross-domain offline RL. We first empirically show that the policy trained with cross-domain offline RL exhibits fragility under dynamics perturbations during evaluation, particularly when target domain data is limited. To address this, we introduce a novel robust cross-domain Bellman (RCB) operator, which enhances test-time robustness against dynamics…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

+ The goal of achieving both train time and test time robustness in offline RL is well motivated. + The RCB operator that separates robust and standard updates is simple especially with the duality result which simplifies the uncertainty set of distributions to one over states. + The paper gives good empirical results with RL benchmarks showing that the approach outperforms baselines under moderate dynamics shifts.

Weaknesses

- The robust Bellman backups and the derived contraction properties are standard. As far as I understand, the main idea is to split robust and standard updates, which is conceptually incremental. The paper seems conceptually incremental for ICLR. - The setup is restrictive as only dynamics shift is modeled. Typically, there is shift in reward, observation, state/action spaces, etc. - The theoretical results largely follow directly from known robust RL results. For example Prop 4.1 showing the c

Reviewer 02Rating 6Confidence 3

Strengths

- The theoretical justification is solid, covering both the idealized case (Proposition 4.1) and the practical case (Proposition 4.3). - The motivation and analysis for both train-time and test-time robustness (Proposition 4.4 and 4.5) are meaningful and potentially impactful, although their direct relevance to practitioners might be limited. - The empirical evaluation is convincing. The chosen baselines are sota methods for offline RL and cross-domain offline RL, yet the proposed algorithm (

Weaknesses

I do not see any major weaknesses worth highlighting.

Reviewer 03Rating 6Confidence 3

Strengths

The problem formulation is both novel and practically important. While existing cross-domain offline RL methods focus exclusively on train-time robustness, this work is the first to systematically study both train-time and test-time robustness together. The motivation is compelling, with Figure 1 clearly demonstrating that policies trained with limited target domain data are highly vulnerable to test-time dynamics perturbations. This observation reveals a critical gap in current approaches that

Weaknesses

My primary concern is the insufficient analysis of generalization. Moreover, the experiments are confined entirely to MuJoCo tasks. Maybe authors can consider more experiment for validation. The paper's own sensitivity analysis (Section 5.3) showed that the optimal values for β and δ vary significantly across different tasks and datasets. In a real-world offline scenario, it is nearly impossible to tune these parameters to their optimal values for a new task due to the inability to validate ag

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Stochastic Gradient Optimization Techniques