GSRM: Generative Speech Reward Model for Speech RLHF
Maohao Shen, Tejas Jayashankar, Osama Hanna, Naoyuki Kanda, Yancheng Wang, Kate\v{r}ina \v{Z}mol\'ikov\'a, Ruiming Xie, Niko Moritz, Anfeng Xu, Yashesh Gaur, Gregory Wornell, Qing He, Jilong Wu

TL;DR
This paper introduces GSRM, a novel generative reward model for speech naturalness evaluation that offers interpretability and improves upon existing predictors, enhancing speech synthesis quality.
Contribution
The paper proposes GSRM, a reasoning-centric, interpretable reward model for speech naturalness, trained with large-scale human feedback and benchmarked against existing methods.
Findings
GSRM outperforms existing speech naturalness predictors.
GSRM achieves model-human correlation approaching inter-rater consistency.
GSRM improves speech LLM generation naturalness via RLHF verification.
Abstract
Recent advances in speech language models, such as GPT-4o Voice Mode and Gemini Live, have demonstrated promising speech generation capabilities. Nevertheless, the aesthetic naturalness of the synthesized audio still lags behind that of human speech. Enhancing generation quality requires a reliable evaluator of speech naturalness. However, existing naturalness evaluators typically regress raw audio to scalar scores, offering limited interpretability of the evaluation and moreover fail to generalize to speech across different taxonomies. Inspired by recent advances in generative reward modeling, we propose the Generative Speech Reward Model (GSRM), a reasoning-centric reward model tailored for speech. The GSRM is trained to decompose speech naturalness evaluation into an interpretable acoustic feature extraction stage followed by feature-grounded chain-of-thought reasoning, enabling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Emotion and Mood Recognition · Face recognition and analysis
