Intentionally-underestimated Value Function at Terminal State for Temporal-difference Learning with Mis-designed Reward
Taisuke Kobayashi

TL;DR
This paper proposes a method to intentionally underestimate the value function at episode termination in TD learning, improving policy stability and robustness across different reward designs and termination conditions.
Contribution
It introduces a novel approach to adjust value estimation at termination, addressing issues caused by traditional zero-value assumptions in TD learning.
Findings
The method stabilizes policy learning in various tasks.
It prevents overestimation caused by termination handling.
Experimental results confirm improved policy optimality.
Abstract
Robot control using reinforcement learning has become popular, but its learning process generally terminates halfway through an episode for safety and time-saving reasons. This study addresses the problem of the most popular exception handling that temporal-difference (TD) learning performs at such termination. That is, by forcibly assuming zero value after termination, unintentionally implicit underestimation or overestimation occurs, depending on the reward design in the normal states. When the episode is terminated due to task failure, the failure may be highly valued with the unintentional overestimation, and the wrong policy may be acquired. Although this problem can be avoided by paying attention to the reward design, it is essential in practical use of TD learning to review the exception handling at termination. This paper therefore proposes a method to intentionally…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Muscle activation and electromyography studies · Robot Manipulation and Learning
