WavReward: Spoken Dialogue Models With Generalist Reward Evaluators

Shengpeng Ji; Tianle Liang; Yangzhuo Li; Jialong Zuo; Minghui Fang; Jinzheng He; Yifu Chen; Zhengqing Liu; Ziyue Jiang; Xize Cheng; Siqi Zheng; Jin Xu; Junyang Lin; Zhou Zhao

arXiv:2505.09558·eess.AS·September 24, 2025

WavReward: Spoken Dialogue Models With Generalist Reward Evaluators

Shengpeng Ji, Tianle Liang, Yangzhuo Li, Jialong Zuo, Minghui Fang, Jinzheng He, Yifu Chen, Zhengqing Liu, Ziyue Jiang, Xize Cheng, Siqi Zheng, Jin Xu, Junyang Lin, Zhou Zhao

PDF

Open Access 1 Repo 3 Reviews

TL;DR

WavReward introduces an audio-based reward evaluation model for spoken dialogue systems, effectively assessing conversational quality and outperforming previous models in accuracy and user preference metrics.

Contribution

The paper presents WavReward, a novel audio language model-based reward evaluator for spoken dialogue, along with a new dataset ChatReward-30K for training and evaluation.

Findings

01

WavReward achieves 91.5% accuracy, surpassing previous models.

02

In subjective tests, WavReward outperforms competitors by 83%.

03

Ablation studies confirm the importance of each component.

Abstract

End-to-end spoken dialogue models such as GPT-4o-audio have recently garnered significant attention in the speech domain. However, the evaluation of spoken dialogue models' conversational performance has largely been overlooked. This is primarily due to the intelligent chatbots convey a wealth of non-textual information which cannot be easily measured using text-based language models like ChatGPT. To address this gap, we propose WavReward, a reward feedback model based on audio language models that can evaluate both the IQ and EQ of spoken dialogue systems with speech input. Specifically, 1) based on audio language models, WavReward incorporates the deep reasoning process and the nonlinear reward mechanism for post-training. By utilizing multi-sample feedback via the reinforcement learning algorithm, we construct a specialized evaluator tailored to spoken dialogue models. 2) We…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 2

Strengths

1. Well-Motivated Problem: The paper effectively highlights a critical, underexplored problem: the lack of evaluation methods that can assess the full spectrum of spoken dialogue models' capabilities, particularly the paralinguistic and emotional quotient (EQ), without converting speech to text. The motivation is clear and strong. 2. Comprehensive Solution (Model & Dataset): The work is holistic, proposing both a novel model (WavReward) and a supporting dataset (ChatReward-30K). This two-pronge

Weaknesses

1. Dataset Construction Details: While the dataset construction process is described in stages, more details would be beneficial. For example, how were the "human experts" selected and calibrated to ensure scoring consistency? The prompt templates in the appendix are helpful, but a more detailed discussion of the challenges in generating high-quality, diverse implicit dialogues would be valuable. 2. Computational Cost and Scalability: The computational cost of WavReward's training and inference

Reviewer 02Rating 8Confidence 5

Strengths

- How to scientifically, comprehensively, and efficiently evaluate speech dialogue models is a critical issue that demands urgent attention. This paper precisely addresses this pain point, making its research highly timely and significant. - The ChatReward-30K dataset represents a significant contribution. The authors provide a detailed account of its construction process, covering multiple dimensions including content, acoustics, and implicit/explicit aspects. This fills a gap in the field by o

Weaknesses

- The ChatReward-30K dataset is primarily synthesized using TTS models. This process may introduce synthetic biases or artifacts (such as unnatural prosody), raising questions about its ability to generalize to real human conversations. - The generated data is filtered by WER and SER. However, no systematic filtering was conducted for accents, pitch , and other factors. Consequently, samples with poor synthesis quality—such as those generated by TTS models that deviate from the specified accent

Reviewer 03Rating 4Confidence 4

Strengths

1. It is an interesting and practical idea to train a reward model that evaluates generated speech based on the dialogue context. 2. The model outperforms existing baselines and SFT results. Experiments on RealDialogue demonstrate that the trained model not only fits well to the benchmark but also generalizes effectively to real-world conversations.

Weaknesses

I think the paper is well-motivated and shows promising potential. 1. Since the thinking process lacks explicit ground truth and is learned through reinforcement learning, it is crucial to examine whether the reasoning process is meaningful and whether it provides useful insights into how the generated speech aligns with dialogue context. Adding human evaluation results to validate this aspect would greatly strengthen the paper. 2. There is a large body of previous work (e.g., [1][2]) that emp

Code & Models

Repositories

jishengpeng/wavreward
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Topic Modeling · Multi-Agent Systems and Negotiation

MethodsSoftmax · Attention Is All You Need