Semi-Supervised Dialogue Policy Learning via Stochastic Reward Estimation
Xinting Huang, Jianzhong Qi, Yu Sun, Rui Zhang

TL;DR
This paper introduces a semi-supervised reward learning method for dialogue policy optimization that models dialogue progress with a dynamics model, reducing the need for extensive expert annotations and improving performance on MultiWOZ.
Contribution
It proposes a novel semi-supervised reward learning approach using a dynamics model and action embeddings, enabling effective dialogue policy training with less supervision.
Findings
Outperforms baseline methods on MultiWOZ dataset
Effective in learning from limited expert annotations
Improves dialogue policy generalization
Abstract
Dialogue policy optimization often obtains feedback until task completion in task-oriented dialogue systems. This is insufficient for training intermediate dialogue turns since supervision signals (or rewards) are only provided at the end of dialogues. To address this issue, reward learning has been introduced to learn from state-action pairs of an optimal policy to provide turn-by-turn rewards. This approach requires complete state-action annotations of human-to-human dialogues (i.e., expert demonstrations), which is labor intensive. To overcome this limitation, we propose a novel reward learning approach for semi-supervised policy learning. The proposed approach learns a dynamics model as the reward function which models dialogue progress (i.e., state-action sequences) based on expert demonstrations, either with or without annotations. The dynamics model computes rewards by predicting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Topic Modeling · Context-Aware Activity Recognition Systems
