Less is More: Clustered Cross-Covariance Control for Offline RL
Nan Qiao, Sheng Yue, Shuning Wang, Yongheng Deng, Ju Ren

TL;DR
This paper introduces Clustered Cross-Covariance Control for TD (C^4), a novel offline RL method that mitigates harmful cross-covariance effects in OOD data, leading to more stable learning and up to 30% higher returns.
Contribution
It proposes buffer partitioning and gradient penalties to reduce covariance bias, improving offline RL performance especially with limited or OOD-dominated datasets.
Findings
Achieves up to 30% improvement in returns over prior methods.
Enhances stability in offline RL with small datasets.
Effectively mitigates distributional shift issues.
Abstract
A fundamental challenge in offline reinforcement learning is distributional shift. Scarce data or datasets dominated by out-of-distribution (OOD) areas exacerbate this issue. Our theoretical analysis and experiments show that the standard squared error objective induces a harmful TD cross covariance. This effect amplifies in OOD areas, biasing optimization and degrading policy learning. To counteract this mechanism, we develop two complementary strategies: partitioned buffer sampling that restricts updates to localized replay partitions, attenuates irregular covariance effects, and aligns update directions, yielding a scheme that is easy to integrate with existing implementations, namely Clustered Cross-Covariance Control for TD (C^4). We also introduce an explicit gradient-based corrective penalty that cancels the covariance induced bias within each update. We prove that buffer…
Peer Reviews
Decision·ICLR 2026 Poster
The paper impressively connects a theoretically grounded understanding of TD learning instability with a practical, well-performing algorithm. The C4 framework stands out for translating a nontrivial analytical insight into empirically robust improvements, showing both conceptual clarity and engineering maturity. **Strong theoretical foundation**: The paper begins from a clear theoretical diagnosis of a subtle but critical problem in temporal-difference (TD) learning — the emergence of harmful
I have only on weakness to point out. **Sensitivity to the number of clusters K**: The proposed method critically depends on the number of clusters used to partition the gradient feature space. If *$K$* is too small, dissimilar gradient modes are mixed, leaving between-cluster covariance unremoved; if *$K$* is too large, covariance estimation becomes noisy due to small sample counts per cluster. However, the paper fixes *$K$* (typically 5) for all experiments without analyzing sensitivity or s
**[S1]** The paper presents novel and interesting analysis that identifies a cross-covariance term that acts as a harmful implicit regularizer in the standard TD loss function, and proposes algorithmic modifications to address this issue in offline RL. **[S2]** Theoretical results are provided that justify the algorithmic design choices and demonstrate their impact on other components of standard offline RL algorithms such as the policy improvement update. **[S3]** Extensive experimental resul
**[W1]** Experiments primarily focus on small datasets and do not include results on the benchmarks for the standard dataset size (e.g., Figure 6 considers a max dataset size that is 10% of the full dataset in most cases), so it is difficult to understand if there are performance tradeoffs on large datasets in order to achieve robust performance on small datasets. **[W2]** The organization and presentation of the work could be improved. In particular, the organization of Sections 5 and 6 requir
Originality: The work offers a novel theoretical perspective linking TD error variance to cross-time gradient covariance. Technical quality: The derivations are mathematically sound and clear. Theorem 1 decomposes the TD variance into beneficial (variance) and harmful (covariance) components, while Theorem 2 proves that single-cluster sampling removes between-cluster interference. The proposed loss (Eq. 12–14) is principled and compatible with existing algorithms. Clarity: The paper is clearl
Outdated Baselines: The empirical evaluation does not include recent state-of-the-art methods, particularly A2PR (Adaptive Advantage-Guided Policy Regularization, arXiv:2405.19909), which achieves significantly higher returns on the same D4RL tasks. Without comparing to such strong baselines, the claimed 30% improvement is not convincing. **Please answer the concern in the response.** Experimental Scope and Depth: Experiments are limited to low-dimensional MuJoCo tasks with 10k samples; no resu
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Adversarial Robustness in Machine Learning · Stochastic Gradient Optimization Techniques
