TL;DR
SDiaReward is a novel reward model for spoken dialogue systems that evaluates modality and colloquialness directly from speech, improving robustness and expressiveness assessment.
Contribution
It introduces SDiaReward, a multi-turn reward model trained on a new dataset, and establishes ESDR-Bench for comprehensive episode-level evaluation.
Findings
Achieves state-of-the-art preference accuracy.
Outperforms general-purpose audio LLMs.
Captures conversational expressiveness beyond superficial cues.
Abstract
The rapid evolution of end-to-end spoken dialogue systems demands transcending mere textual semantics to incorporate paralinguistic nuances and the spontaneous nature of human conversation. However, current methods struggle with two critical gaps: the modality gap, involving prosody and emotion, and the colloquialness gap, distinguishing written scripts from natural speech. To address these challenges, we introduce SDiaReward, an end-to-end multi-turn reward model trained on SDiaReward-Dataset, a novel collection of episode-level preference pairs explicitly targeting these gaps. It operates directly on full multi-turn speech episodes and is optimized with pairwise preference supervision, enabling joint assessment of modality and colloquialness in a single evaluator. We further establish ESDR-Bench, a stratified benchmark for robust episode-level evaluation. Experiments demonstrate that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
