GSRM: Generative Speech Reward Model for Speech RLHF

Maohao Shen; Tejas Jayashankar; Osama Hanna; Naoyuki Kanda; Yancheng Wang; Kate\v{r}ina \v{Z}mol\'ikov\'a; Ruiming Xie; Niko Moritz; Anfeng Xu; Yashesh Gaur; Gregory Wornell; Qing He; Jilong Wu

arXiv:2602.13891·cs.SD·February 17, 2026

GSRM: Generative Speech Reward Model for Speech RLHF

Maohao Shen, Tejas Jayashankar, Osama Hanna, Naoyuki Kanda, Yancheng Wang, Kate\v{r}ina \v{Z}mol\'ikov\'a, Ruiming Xie, Niko Moritz, Anfeng Xu, Yashesh Gaur, Gregory Wornell, Qing He, Jilong Wu

PDF

Open Access

TL;DR

This paper introduces GSRM, a novel generative reward model for speech naturalness evaluation that offers interpretability and improves upon existing predictors, enhancing speech synthesis quality.

Contribution

The paper proposes GSRM, a reasoning-centric, interpretable reward model for speech naturalness, trained with large-scale human feedback and benchmarked against existing methods.

Findings

01

GSRM outperforms existing speech naturalness predictors.

02

GSRM achieves model-human correlation approaching inter-rater consistency.

03

GSRM improves speech LLM generation naturalness via RLHF verification.

Abstract

Recent advances in speech language models, such as GPT-4o Voice Mode and Gemini Live, have demonstrated promising speech generation capabilities. Nevertheless, the aesthetic naturalness of the synthesized audio still lags behind that of human speech. Enhancing generation quality requires a reliable evaluator of speech naturalness. However, existing naturalness evaluators typically regress raw audio to scalar scores, offering limited interpretability of the evaluation and moreover fail to generalize to speech across different taxonomies. Inspired by recent advances in generative reward modeling, we propose the Generative Speech Reward Model (GSRM), a reasoning-centric reward model tailored for speech. The GSRM is trained to decompose speech naturalness evaluation into an interpretable acoustic feature extraction stage followed by feature-grounded chain-of-thought reasoning, enabling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Emotion and Mood Recognition · Face recognition and analysis