TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis

Xi Wang; Jie Wang; Xingchen Song; Baijun Song; Jingran Xie; Jiahe Shao; Zijian Lin; Di Wu; Meng Meng; Jian Luan; Zhiyong Wu

arXiv:2604.22225·cs.CL·April 27, 2026

TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis

Xi Wang, Jie Wang, Xingchen Song, Baijun Song, Jingran Xie, Jiahe Shao, Zijian Lin, Di Wu, Meng Meng, Jian Luan, Zhiyong Wu

PDF

1 Repo 1 Models

TL;DR

TTS-PRISM is an interpretable, multi-dimensional diagnostic framework for Mandarin TTS that diagnoses fine-grained acoustic artifacts and explains perceptual issues, outperforming generalist models.

Contribution

It introduces a 12-dimensional schema, a targeted synthesis pipeline, and schema-driven instruction tuning for detailed TTS diagnostics.

Findings

01

TTS-PRISM outperforms generalist models in human alignment.

02

Profiles six TTS paradigms revealing capability differences.

03

Open-source code and checkpoints available at GitHub.

Abstract

While generative text-to-speech (TTS) models approach human-level quality, monolithic metrics fail to diagnose fine-grained acoustic artifacts or explain perceptual collapse. To address this, we propose TTS-PRISM, a multi-dimensional diagnostic framework for Mandarin. First, we establish a 12-dimensional schema spanning stability to advanced expressiveness. Second, we design a targeted synthesis pipeline with adversarial perturbations and expert anchors to build a high-quality diagnostic dataset. Third, schema-driven instruction tuning embeds explicit scoring criteria and reasoning into an efficient end-to-end model. Experiments on a 1,600-sample Gold Test Set show TTS-PRISM outperforms generalist models in human alignment. Profiling six TTS paradigms establishes intuitive diagnostic flags that reveal fine-grained capability differences. TTS-PRISM is open-source, with code and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xiaomi-research/tts-prism
github

Models

🤗
xiaomi-research/TTS-PRISM-7B
model· 66 dl
66 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.