Evaluating and Rewarding LALMs for Expressive Role-Play TTS via Mean Continuation Log-Probability
Yong Ren, Jingbei Li, Haiyang Sun, Yujie Chen, Cheng Yi, Yechang Huang, Hao Gu, Ye Bai, Xuerui Yang

TL;DR
This paper introduces Mean Continuation Log-Probability (MCLP), a novel metric and reward for evaluating and improving expressive role-play TTS with Large Audio Language Models, ensuring stylistic consistency and better alignment with instructions.
Contribution
We propose MCLP as an objective metric and reinforcement learning reward to enhance stylistic consistency in LALM-based Role-Play TTS, validated on a new annotated dataset.
Findings
MCLP effectively quantifies stylistic consistency.
Reinforcement learning with MCLP improves style alignment.
Our method outperforms baseline models on objective and subjective metrics.
Abstract
Recent advances in Large Audio Language Models (LALMs) have extended Text-to-Speech (TTS) to interactive role-play scenarios, which demand high expressiveness and strict adherence to role-play instructions. However, existing models struggle to maintain stylistic consistency with character profiles and scene descriptions across multi-turn dialogues. A critical bottleneck is the lack of objective metrics for quantifying speaking style. To bridge this gap, we propose Mean Continuation Log-Probability (MCLP) as both an evaluation metric and a reward signal, validated on LALM-based Role-Play TTS (RP-TTS) tasks. Critically, we leverage the In-Context Learning capability of pre-trained LALMs to formulate MCLP via a continuation log-probability prediction. This metric quantifies stylistic consistency by measuring the likelihood of the ground-truth speech conditioned on the generated speech.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Authorship Attribution and Profiling · Speech and dialogue systems
