Temporal Difference Learning with Constrained Initial Representations
Jiafei Lyu, Jingwen Yang, Zhongjian Qiao, Runze Liu, Zeyuan Liu, Deheng Ye, Zongqing Lu, Xiu Li

TL;DR
This paper proposes a novel framework called CIR that constrains initial representations in off-policy RL using Tanh activation, leading to improved stability and sample efficiency, with strong empirical results on control tasks.
Contribution
It introduces the CIR framework with Tanh-based constraints, skip connections, and convex Q-learning, providing theoretical analysis and superior empirical performance.
Findings
CIR outperforms baseline methods on continuous control tasks.
Theoretical analysis confirms convergence properties of the proposed approach.
Empirical results demonstrate improved stability and sample efficiency.
Abstract
Recently, there have been numerous attempts to enhance the sample efficiency of off-policy reinforcement learning (RL) agents when interacting with the environment, including architecture improvements and new algorithms. Despite these advances, they overlook the potential of directly constraining the initial representations of the input data, which can intuitively alleviate the distribution shift issue and stabilize training. In this paper, we introduce the Tanh function into the initial layer to fulfill such a constraint. We theoretically unpack the convergence property of the temporal difference learning with the Tanh function under linear function approximation. Motivated by theoretical insights, we present our Constrained Initial Representations framework, tagged CIR, which is made up of three components: (i) the Tanh activation along with normalization methods to stabilize…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The results are presented clearly and the experiments conducted are sound and varied.
I find that the main weakness of the paper is the limited novelty. Many of the ideas in CIR have been considered before (or are a small extension of existing ideas) and some are not motivated clearly. The paper primarily combines these techniques and there seems to be a lack of focus. In more detail: - The convex Q-learning update is a small extension to using the mean Q-value from two critics, which is prominently used in BRO for example. It's unclear how much this extension adds compared to us
**Motivation** * The problem of studying architectural components that improve deep reinforcement learning is important and well motivated. **Clarity** * The text is well written and easy to follow. **Related Work** * The related work section is quite extensive. **Experimental Design and Analyses** * The choices of benchmarks and baselines are appropriate and extensive.
**Motivation** * There are two key issues that I see with the motivation of the method specifically. * It is unclear to me why previous studies on regularization would not immediately extend to early neural network layers and why specifically constraining the inputs is important. This is not clearly communicated in the text either. * The choice of tanh activations rather than any other function seems arbitrary and not well motivated. The properties that the tanh function brings are satis
The paper presents thorough experiments on their proposed method.
Theoretical results on representation learning within a fixed representation linear function approximation setting seem unsuited to shine light on the effectiveness of the method. In addition, the analysis assumes that the tanh function is applied after the representation function $\phi$. However, most of the representation learning in the empirical method happens *after* the tanh function. The method analyzed here is more akin to OFN (Hussing et al.) or hyperspherical normalization (Lee et al.)
Overall, the paper is quite easy to follow and it suggests a novel neural network architecture for Q-learning based on combination of Tanh for early layers + U-net for the main body of the neural network, and they propose a novel TD loss based on convex combination between two types of aggregation of target Q-networks. To confirm the algorithmic design choices, the authors perform massive ablation studies, that confirm the necessity of regularisation layer at the beginning of the neural network
As mentioned before, while CIR achieves a clear improvement on HumanoidBench tasks, in general it is less good than baselines on DMC tasks as shown in Figure 4. The most of theoretical results are not that original considering that there are prior works that investigate the stability of TD-learning with normalisation layers. Compare to https://arxiv.org/pdf/2407.04811 (a missing literature), where they did a more nuanced analysis of TD-stabilty, where they have decoupled the instability coming
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Adaptive Dynamic Programming Control · Domain Adaptation and Few-Shot Learning
