VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

Jiacheng Xu; Heting Gao; Liufei Xie; Zhenchuan Yang; Lijiang Li; Yiting Chen; Bin Zhang; Meng Chen; Chaoyu Fu; Weifeng Zhao; Wenjiang Zhou

arXiv:2605.06765·cs.CL·May 11, 2026

VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

Jiacheng Xu, Heting Gao, Liufei Xie, Zhenchuan Yang, Lijiang Li, Yiting Chen, Bin Zhang, Meng Chen, Chaoyu Fu, Weifeng Zhao, Wenjiang Zhou

PDF

1 Repo

TL;DR

VITA-QinYu is an expressive end-to-end spoken language model capable of role-playing and singing, trained on 15.8K hours of synthesized data, outperforming existing models in expressiveness and conversational fluency.

Contribution

The paper introduces VITA-QinYu, the first E2E SLM supporting role-playing and singing, with a hybrid speech-text approach and a large synthetic dataset for training.

Findings

01

Outperforms peer SLMs by 7% on role-playing benchmarks.

02

Achieves 0.13 MOS points higher on singing quality.

03

Surpasses prior models in conversational accuracy and fluency.

Abstract

Human speech conveys expressiveness beyond linguistic content, including personality, mood, or performance elements, such as a comforting tone or humming a song, which we formalize as role-playing and singing. We present VITA-QinYu, the first expressive end-to-end (E2E) spoken language model (SLM) that goes beyond natural conversation to support both role-playing and singing generation. VITA-QinYu adopts a hybrid speech-text paradigm that extends interleaved text-audio modeling with multi-codebook audio tokens, a design enabling richer paralinguistic representation while preserving a clear separation between modalities to avoid interference. We further develop a comprehensive data generation pipeline to synthesize a total of 15.8K hours of natural conversation, role-playing, and singing data for training. VITA-QinYu demonstrates superior expressiveness, outperforming peer SLMs by 7…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

null
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.