Anti-Overestimation Dialogue Policy Learning for Task-Completion Dialogue System
Chang Tian, Wenpeng Yin, Marie-Francine Moens

TL;DR
This paper introduces a dynamic partial average estimator to reduce overestimation in reinforcement learning for dialogue systems, improving stability and performance across multiple datasets.
Contribution
It proposes a novel DPAV method that adaptively mitigates overestimation bias in RL-based dialogue policy learning, with theoretical convergence guarantees.
Findings
Achieves better or comparable results to top baselines
Lower computational load compared to existing methods
Provides theoretical proof of convergence and bias bounds
Abstract
A dialogue policy module is an essential part of task-completion dialogue systems. Recently, increasing interest has focused on reinforcement learning (RL)-based dialogue policy. Its favorable performance and wise action decisions rely on an accurate estimation of action values. The overestimation problem is a widely known issue of RL since its estimate of the maximum action value is larger than the ground truth, which results in an unstable learning process and suboptimal policy. This problem is detrimental to RL-based dialogue policy learning. To mitigate this problem, this paper proposes a dynamic partial average estimator (DPAV) of the ground truth maximum action value. DPAV calculates the partial average between the predicted maximum action value and minimum action value, where the weights are dynamically adaptive and problem-dependent. We incorporate DPAV into a deep Q-network as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Topic Modeling
