StyleBench: Evaluating Speech Language Models on Conversational Speaking Style Control
Haishu Zhao, Aokai Hao, Yuan Ge, Zhenqiang Hong, Tong Xiao, Jingbo Zhu

TL;DR
StyleBench is a comprehensive benchmark designed to evaluate speech language models' ability to control speaking style intensity across multiple dimensions in conversational settings, highlighting current performance gaps.
Contribution
This paper introduces StyleBench, the first systematic benchmark for assessing style intensity control in speech language models during multi-turn dialogues.
Findings
Leading SLMs show significant performance gaps in style control.
Performance varies across emotion, speed, volume, and pitch dimensions.
Analysis suggests potential directions for improving style control in future models.
Abstract
Speech language models (SLMs) have significantly extended the interactive capability of text-based Large Language Models (LLMs) by incorporating paralinguistic information. For more realistic interactive experience with customized styles, current SLMs have managed to interpret and control speaking style intensity from user prompts during the dialogue process. However, there remains a lack of systematic benchmarks that quantifies and evaluates the style intensity control ability in conversations. In this paper, we propose StyleBench, a multi-turn dialogue benchmark for comprehensively evaluating the style intensity control ability across four dimensions: emotion, speed, volume, and pitch. Our results reveal the performance gaps between leading SLMs and omni language models (OLMs), suggesting the underlying reasons and promising approaches for future exploration.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Mental Health via Writing · Speech and dialogue systems
