Less is More: Clustered Cross-Covariance Control for Offline RL

Nan Qiao; Sheng Yue; Shuning Wang; Yongheng Deng; Ju Ren

arXiv:2601.20765·cs.LG·February 3, 2026

Less is More: Clustered Cross-Covariance Control for Offline RL

Nan Qiao, Sheng Yue, Shuning Wang, Yongheng Deng, Ju Ren

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Clustered Cross-Covariance Control for TD (C^4), a novel offline RL method that mitigates harmful cross-covariance effects in OOD data, leading to more stable learning and up to 30% higher returns.

Contribution

It proposes buffer partitioning and gradient penalties to reduce covariance bias, improving offline RL performance especially with limited or OOD-dominated datasets.

Findings

01

Achieves up to 30% improvement in returns over prior methods.

02

Enhances stability in offline RL with small datasets.

03

Effectively mitigates distributional shift issues.

Abstract

A fundamental challenge in offline reinforcement learning is distributional shift. Scarce data or datasets dominated by out-of-distribution (OOD) areas exacerbate this issue. Our theoretical analysis and experiments show that the standard squared error objective induces a harmful TD cross covariance. This effect amplifies in OOD areas, biasing optimization and degrading policy learning. To counteract this mechanism, we develop two complementary strategies: partitioned buffer sampling that restricts updates to localized replay partitions, attenuates irregular covariance effects, and aligns update directions, yielding a scheme that is easy to integrate with existing implementations, namely Clustered Cross-Covariance Control for TD (C^4). We also introduce an explicit gradient-based corrective penalty that cancels the covariance induced bias within each update. We prove that buffer…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 2

Strengths

The paper impressively connects a theoretically grounded understanding of TD learning instability with a practical, well-performing algorithm. The C4 framework stands out for translating a nontrivial analytical insight into empirically robust improvements, showing both conceptual clarity and engineering maturity. **Strong theoretical foundation**: The paper begins from a clear theoretical diagnosis of a subtle but critical problem in temporal-difference (TD) learning — the emergence of harmful

Weaknesses

I have only on weakness to point out. **Sensitivity to the number of clusters K**: The proposed method critically depends on the number of clusters used to partition the gradient feature space. If *$K$* is too small, dissimilar gradient modes are mixed, leaving between-cluster covariance unremoved; if *$K$* is too large, covariance estimation becomes noisy due to small sample counts per cluster. However, the paper fixes *$K$* (typically 5) for all experiments without analyzing sensitivity or s

Reviewer 02Rating 6Confidence 3

Strengths

**[S1]** The paper presents novel and interesting analysis that identifies a cross-covariance term that acts as a harmful implicit regularizer in the standard TD loss function, and proposes algorithmic modifications to address this issue in offline RL. **[S2]** Theoretical results are provided that justify the algorithmic design choices and demonstrate their impact on other components of standard offline RL algorithms such as the policy improvement update. **[S3]** Extensive experimental resul

Weaknesses

**[W1]** Experiments primarily focus on small datasets and do not include results on the benchmarks for the standard dataset size (e.g., Figure 6 considers a max dataset size that is 10% of the full dataset in most cases), so it is difficult to understand if there are performance tradeoffs on large datasets in order to achieve robust performance on small datasets. **[W2]** The organization and presentation of the work could be improved. In particular, the organization of Sections 5 and 6 requir

Reviewer 03Rating 4Confidence 4

Strengths

Originality: The work offers a novel theoretical perspective linking TD error variance to cross-time gradient covariance. Technical quality: The derivations are mathematically sound and clear. Theorem 1 decomposes the TD variance into beneficial (variance) and harmful (covariance) components, while Theorem 2 proves that single-cluster sampling removes between-cluster interference. The proposed loss (Eq. 12–14) is principled and compatible with existing algorithms. Clarity: The paper is clearl

Weaknesses

Outdated Baselines: The empirical evaluation does not include recent state-of-the-art methods, particularly A2PR (Adaptive Advantage-Guided Policy Regularization, arXiv:2405.19909), which achieves significantly higher returns on the same D4RL tasks. Without comparing to such strong baselines, the claimed 30% improvement is not convincing. **Please answer the concern in the response.** Experimental Scope and Depth: Experiments are limited to low-dimensional MuJoCo tasks with 10k samples; no resu

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Adversarial Robustness in Machine Learning · Stochastic Gradient Optimization Techniques