Constrained Policy Optimization for Controlled Self-Learning in Conversational AI Systems
Mohammad Kachuee, Sungjin Lee

TL;DR
This paper introduces a scalable constrained policy optimization framework for conversational AI that balances user satisfaction improvements with domain-specific safety constraints, using a novel meta-gradient approach.
Contribution
It proposes a new meta-gradient learning method for adaptive constraint satisfaction in domain-specific conversational AI policy optimization.
Findings
Achieves a better balance between policy value and constraint satisfaction.
Demonstrates effectiveness on real-world conversational AI data.
Outperforms existing methods in constraint adherence and user satisfaction.
Abstract
Recently, self-learning methods based on user satisfaction metrics and contextual bandits have shown promising results to enable consistent improvements in conversational AI systems. However, directly targeting such metrics by off-policy bandit learning objectives often increases the risk of making abrupt policy changes that break the current user experience. In this study, we introduce a scalable framework for supporting fine-grained exploration targets for individual domains via user-defined constraints. For example, we may want to ensure fewer policy deviations in business-critical domains such as shopping, while allocating more exploration budget to domains such as music. Furthermore, we present a novel meta-gradient learning approach that is scalable and practical to address this problem. The proposed method adjusts constraint violation penalty terms adaptively through a meta…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Bandit Algorithms Research · Recommender Systems and Techniques
MethodsSelf-Learning
