Optimizing Conversational Quality in Spoken Dialogue Systems with Reinforcement Learning from AI Feedback

Siddhant Arora; Jinchuan Tian; Jiatong Shi; Hayato Futami; Yosuke Kashiwagi; Emiru Tsunoo; Shinji Watanabe

arXiv:2601.19063·cs.CL·January 28, 2026

Optimizing Conversational Quality in Spoken Dialogue Systems with Reinforcement Learning from AI Feedback

Siddhant Arora, Jinchuan Tian, Jiatong Shi, Hayato Futami, Yosuke Kashiwagi, Emiru Tsunoo, Shinji Watanabe

PDF

Open Access

TL;DR

This paper introduces a multi-reward reinforcement learning framework for spoken dialogue systems that optimizes multiple aspects of conversational quality, such as semantics, audio naturalness, and emotion, using incremental decision-making.

Contribution

It presents the first multi-reward RLAIF approach for SDS, combining various rewards and aligning incremental responses with preference learning, supported by a new dataset.

Findings

01

Multi-reward training improves semantic quality and audio naturalness.

02

Single-reward training enhances only targeted metrics.

03

The framework supports incremental, blockwise response optimization.

Abstract

Reinforcement learning from human or AI feedback (RLHF/RLAIF) for speech-in/speech-out dialogue systems (SDS) remains underexplored, with prior work largely limited to single semantic rewards applied at the utterance level. Such setups overlook the multi-dimensional and multi-modal nature of conversational quality, which encompasses semantic coherence, audio naturalness, speaker consistency, emotion alignment, and turn-taking behavior. Moreover, they are fundamentally mismatched with duplex spoken dialogue systems that generate responses incrementally, where agents must make decisions based on partial utterances. We address these limitations with the first multi-reward RLAIF framework for SDS, combining semantic, audio-quality, and emotion-consistency rewards. To align utterance-level preferences with incremental, blockwise decoding in duplex models, we apply turn-level preference…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Topic Modeling · Emotion and Mood Recognition