Cross-Domain Offline Policy Adaptation via Selective Transition Correction
Mengbei Yan, Jiafei Lyu, Shengjie Sun, Zhongjian Qiao, Jingwen Yang, Zichuan Lin, Deheng Ye, Xiu Li

TL;DR
This paper introduces the Selective Transition Correction (STC) algorithm for cross-domain offline reinforcement learning, effectively aligning source data with target dynamics to improve policy adaptation.
Contribution
The paper proposes a novel method that modifies source domain data using inverse policy and reward models, combined with a forward dynamics model, to better match target domain dynamics.
Findings
STC outperforms existing baselines in environments with dynamics shifts.
The approach effectively aligns source data with target dynamics, improving policy learning.
Experimental results demonstrate the robustness of STC across various environments.
Abstract
It remains a critical challenge to adapt policies across domains with mismatched dynamics in reinforcement learning (RL). In this paper, we study cross-domain offline RL, where an offline dataset from another similar source domain can be accessed to enhance policy learning upon a target domain dataset. Directly merging the two datasets may lead to suboptimal performance due to potential dynamics mismatches. Existing approaches typically mitigate this issue through source domain transition filtering or reward modification, which, however, may lead to insufficient exploitation of the valuable source domain data. Instead, we propose to modify the source domain data into the target domain data. To that end, we leverage an inverse policy model and a reward model to correct the actions and rewards of source transitions, explicitly achieving alignment with the target dynamics. Since limited…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper theoretically analyzes the dynamics and value discrepancy induced by transition corrections with explicit assumptions and supporting proofs. The paper conducts extensive experiments on various task domains and compares with sufficient baselines.
The success of STC depends heavily on the quality of the inverse policy and forward dynamics models. Insufficient or low-diversity target data will likely lead to suboptimal or even detrimental corrections. The Taylor expansion in section 4.1 assumes local smoothness and is clipped for stability, but the limitations of this approximation—especially in highly nonlinear reward landscapes—have not yet been explored empirically or theoretically. Such approximations may produce inaccurate rewards fo
* STC can explicitly make the source data align with the target dynamics. * STC achieves superior performance against existing baselines.
* The method trains an inverse policy and a reward model, which may bring more computational burden.
- The paper is well-structured, with clear introduction of the methodology and experimental setup. - The paper proposes modifying source domain transitions to align with the target domain, as opposed to filtering data. In principle, this method is expected to improve the efficiency of data usage.
- The core idea is intuitive and reasonable, yet the proposed algorithm appears to be a forced assembly of distinct modules, lacking rigorous validation. In particular, the integration of the forward dynamics model undermines the paper’s logical coherence, while the designs of the inverse policy model and reward model also become largely meaningless. - There is a large disconnect between the theory part and the proposed algorithm. While the THEORETICAL ANALYSIS aims to verify the validity of th
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Reinforcement Learning in Robotics · Adversarial Robustness in Machine Learning
