TD-M(PC)$^2$: Improving Temporal Difference MPC Through Policy Constraint
Haotian Lin, Pengcheng Wang, Jeff Schneider, Guanya Shi

TL;DR
This paper identifies a structural policy mismatch in model-based reinforcement learning that causes value overestimation and proposes a simple regularization method to mitigate it, significantly improving performance in complex control tasks.
Contribution
It introduces a minimalist policy regularization technique to reduce out-of-distribution queries, addressing value overestimation in model-based RL without extra computation.
Findings
Improved performance over TD-MPC2 in complex humanoid tasks.
Significant reduction in value overestimation.
Effective mitigation of policy mismatch issues.
Abstract
Model-based reinforcement learning algorithms that combine model-based planning and learned value/policy prior have gained significant recognition for their high data efficiency and superior performance in continuous control. However, we discover that existing methods that rely on standard SAC-style policy iteration for value learning, directly using data generated by the planner, often result in \emph{persistent value overestimation}. Through theoretical analysis and experiments, we argue that this issue is deeply rooted in the structural policy mismatch between the data generation policy that is always bootstrapped by the planner and the learned policy prior. To mitigate such a mismatch in a minimalist way, we propose a policy regularization term reducing out-of-distribution (OOD) queries, thereby improving value learning. Our method involves minimum changes on top of existing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Simulation Techniques and Applications · Advanced Control Systems Optimization
