TL;DR
TTS-PRISM is an interpretable, multi-dimensional diagnostic framework for Mandarin TTS that diagnoses fine-grained acoustic artifacts and explains perceptual issues, outperforming generalist models.
Contribution
It introduces a 12-dimensional schema, a targeted synthesis pipeline, and schema-driven instruction tuning for detailed TTS diagnostics.
Findings
TTS-PRISM outperforms generalist models in human alignment.
Profiles six TTS paradigms revealing capability differences.
Open-source code and checkpoints available at GitHub.
Abstract
While generative text-to-speech (TTS) models approach human-level quality, monolithic metrics fail to diagnose fine-grained acoustic artifacts or explain perceptual collapse. To address this, we propose TTS-PRISM, a multi-dimensional diagnostic framework for Mandarin. First, we establish a 12-dimensional schema spanning stability to advanced expressiveness. Second, we design a targeted synthesis pipeline with adversarial perturbations and expert anchors to build a high-quality diagnostic dataset. Third, schema-driven instruction tuning embeds explicit scoring criteria and reasoning into an efficient end-to-end model. Experiments on a 1,600-sample Gold Test Set show TTS-PRISM outperforms generalist models in human alignment. Profiling six TTS paradigms establishes intuitive diagnostic flags that reveal fine-grained capability differences. TTS-PRISM is open-source, with code and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
