Optimizing Conversational Quality in Spoken Dialogue Systems with Reinforcement Learning from AI Feedback
Siddhant Arora, Jinchuan Tian, Jiatong Shi, Hayato Futami, Yosuke Kashiwagi, Emiru Tsunoo, Shinji Watanabe

TL;DR
This paper introduces a multi-reward reinforcement learning framework for spoken dialogue systems that optimizes multiple aspects of conversational quality, such as semantics, audio naturalness, and emotion, using incremental decision-making.
Contribution
It presents the first multi-reward RLAIF approach for SDS, combining various rewards and aligning incremental responses with preference learning, supported by a new dataset.
Findings
Multi-reward training improves semantic quality and audio naturalness.
Single-reward training enhances only targeted metrics.
The framework supports incremental, blockwise response optimization.
Abstract
Reinforcement learning from human or AI feedback (RLHF/RLAIF) for speech-in/speech-out dialogue systems (SDS) remains underexplored, with prior work largely limited to single semantic rewards applied at the utterance level. Such setups overlook the multi-dimensional and multi-modal nature of conversational quality, which encompasses semantic coherence, audio naturalness, speaker consistency, emotion alignment, and turn-taking behavior. Moreover, they are fundamentally mismatched with duplex spoken dialogue systems that generate responses incrementally, where agents must make decisions based on partial utterances. We address these limitations with the first multi-reward RLAIF framework for SDS, combining semantic, audio-quality, and emotion-consistency rewards. To align utterance-level preferences with incremental, blockwise decoding in duplex models, we apply turn-level preference…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Topic Modeling · Emotion and Mood Recognition
