Decoupled Q-Chunking
Qiyang Li, Seohong Park, Sergey Levine

TL;DR
This paper introduces a novel reinforcement learning algorithm that decouples critic and policy chunk lengths, improving long-horizon task performance by combining multi-step value propagation with flexible policy action chunking.
Contribution
It proposes a method to optimize policies against a distilled critic, enabling effective use of shorter policy chunks while leveraging longer critic chunks for better value estimation.
Findings
Outperforms prior methods on long-horizon offline goal-conditioned tasks
Effectively balances multi-step value propagation with policy reactivity
Demonstrates robustness in complex, extended horizon environments
Abstract
Temporal-difference (TD) methods learn state and action values efficiently by bootstrapping from their own future value predictions, but such a self-bootstrapping mechanism is prone to bootstrapping bias, where the errors in the value targets accumulate across steps and result in biased value estimates. Recent work has proposed to use chunked critics, which estimate the value of short action sequences ("chunks") rather than individual actions, speeding up value backup. However, extracting policies from chunked critics is challenging: policies must output the entire action chunk open-loop, which can be sub-optimal for environments that require policy reactivity and also challenging to model especially when the chunk length grows. Our key insight is to decouple the chunk length of the critic from that of the policy, allowing the policy to operate over shorter action chunks. We propose a…
Peer Reviews
Decision·ICLR 2026 Poster
- Well-written and detailed theoretical investigation. - Strong and robust results.
- Some minor grammatical errors (e.g., l. 26, 182) - There is no discussion of the computation overhead compared to the baseline methods (e.g., from maintaining the additional distilled critic). Since the performance gap to the baselines is substantial, providing a short intuition should suffice.
1. The core idea of decoupling the policy and critic chunk sizes is novel, effectively addressing a known trade-off in multi-step Q-learning to get "the best of both worlds." 2. The paper provides deep theoretical backing for Q-learning with action chunking, formally identifying and quantifying bias, and proving the conditions under which the approach is superior. 3. It demonstrates superior, state-of-the-art results on challenging long-horizon tasks, significantly outperforming previous methods
1. The approach still suffers from the inherent bias of open-loop value evaluation in action chunking and lacks a mechanism to actively correct it. 2. Its theoretical guarantees rely on a strong "open-loop consistency" assumption for the offline dataset, which may not hold in many real-world scenarios, limiting the generality of the claims. 3. The use of a fixed, global chunk size for both the policy and critic is a limitation, as the optimal action horizon might vary depending on the state. 4.
1. This paper derives explicit bias and near-optimality bounds, offering theoretical insights of when action-chunking methods succeed or fail. 2. Empirical performance of DQC is nice.
1. Theorem 4.6 is somewhat idealized. It depends on the open-loop consistency assumption. It is hard to hold in the practice for realistic offline datasets, especially in long-horizon settings, in such cases $\epsilon_h$ may not be small, and thus the errors scale and then the bound can become vacuous. Could the authors provide an empirical analysis on how big is $\epsilon_h$ in the OGBench datasets? 2. Theorem 4.6 does not seem to cover the decoupled critics. Is this bound still valid when the
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Domain Adaptation and Few-Shot Learning
