Cross-Domain Policy Optimization via Bellman Consistency and Hybrid Critics
Ming-Hong Chen, Kuan-Chen Pan, You-De Huang, Xi Liu, Ping-Chun Hsieh

TL;DR
This paper introduces a novel cross-domain RL method called QAvatar that leverages Bellman consistency and hybrid critics to improve transferability across diverse tasks, addressing domain differences and transferability challenges.
Contribution
The paper proposes a new framework using cross-domain Bellman consistency and hybrid critics, with an adaptive weighting scheme for effective transfer in cross-domain RL.
Findings
QAvatar achieves reliable transfer across various benchmarks.
It effectively leverages source domain Q functions for target learning.
The method demonstrates superior performance over existing approaches.
Abstract
Cross-domain reinforcement learning (CDRL) is meant to improve the data efficiency of RL by leveraging the data samples collected from a source domain to facilitate the learning in a similar target domain. Despite its potential, cross-domain transfer in RL is known to have two fundamental and intertwined challenges: (i) The source and target domains can have distinct state space or action space, and this makes direct transfer infeasible and thereby requires more sophisticated inter-domain mappings; (ii) The transferability of a source-domain model in RL is not easily identifiable a priori, and hence CDRL can be prone to negative effect during transfer. In this paper, we propose to jointly tackle these two challenges through the lens of \textit{cross-domain Bellman consistency} and \textit{hybrid critic}. Specifically, we first introduce the notion of cross-domain Bellman consistency as…
Peer Reviews
Decision·ICLR 2026 Poster
This paper proposes the formal definition of cross-domain Bellman consistency and its incorporation into the analysis of transferability represents a strong conceptual advance over prior methods. The paper presents a detailed theoretical analysis with proofs supporting the soundness of QAvatar’s adaptive weighting scheme.
Although the hybrid critic can theoretically downweight poor source Q-functions, the effectiveness of knowledge transfer fundamentally depends on the quality of the state/action mapping functions $\phi$ and $\psi$. When mapping is highly ambiguous or misaligned, both the source-supplied and hybrid Q-values may be misleading. While this paper classifies major approaches and discusses adversarial alignment and cycle consistency, the comparison or discussion with alternative alignment strategies (
- The central concept of using a hybrid critic weighted by a measure of "transferability" is very clever. The automatic, hyperparameter-free weighting scheme, which pits the target TD error against the cross-domain Bellman error, is an elegant solution to the problem of negative transfer. - Learning the state-action mappings by minimizing the cross-domain Bellman loss (a reward- and dynamics-aware objective) is a strong alternative to unsupervised, dynamics-agnostic methods like cycle-consistenc
- The practical implementation of QAvatar is very complex. It requires running a full SAC algorithm while also training two normalizing flow models for the state and action mappings, and also calculating two separate Bellman errors (target and cross-domain) at every step (or every $N_\alpha$ steps) to compute the weight. The paper notes this results in "about twice the training time of SAC," which is a major practical limitation and cost. - The paper uses normalizing flows for the mappings, whic
1. Clean problem framing for CDRL with mismatched spaces; explicit transferability notion via a Bellman‑style residual 2. Hybrid critic that can down‑weight a poor source, limiting negative transfer 3. A formal tabular NPG analysis that separates learning progress and approximation error 4. The experimental evaluation is comprehensive and convincing, covering diverse tasks and challenging transfer scenarios
1. The theory-practice gap is substantial: the convergence analysis assumes on-policy, tabular NPG, while the implementation is off-policy, deep RL with replay buffers and twin Q-networks. The critical assumptions of the bound (e.g., on-policy error norms) do not hold in the implemented algorithm. 2. The claim of being “hyperparameter-free” is overstated: the update frequency $N_\alpha$ for the adaptive weight acts as a hidden hyperparameter, and no ablation study is provided. 3. The loss to u
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Adaptive Dynamic Programming Control
