Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection

Junchuan Zhao; Minh Duc Vu; Ye Wang

arXiv:2603.05373·cs.SD·April 14, 2026

Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection

Junchuan Zhao, Minh Duc Vu, Ye Wang

PDF

1 Repo

TL;DR

This paper introduces MSpoof-TTS, a training-free hierarchical decoding framework that uses multi-resolution spoof detection to improve zero-shot discrete speech synthesis quality and robustness.

Contribution

It proposes a novel multi-resolution spoof detection method integrated into hierarchical decoding, enhancing speech synthesis without retraining or preference optimization.

Findings

01

Improved perceptual realism in zero-shot speech synthesis.

02

Effective detection of unnatural patterns at multiple temporal scales.

03

Enhanced robustness against token-level artifacts.

Abstract

Neural codec language models enable high-quality discrete speech synthesis, yet their inference remains vulnerable to token-level artifacts and distributional drift that degrade perceptual realism. Rather than relying on preference optimization or retraining, we propose MSpoof-TTS, a training-free inference framework that improves zero-shot synthesis through multi-resolution spoof guidance. We introduce a Multi-Resolution Token-based Spoof Detection framework that evaluates codec sequences at different temporal granularities to detect locally inconsistent or unnatural patterns. We then integrate the spoof detectors into a hierarchical decoding strategy, progressively pruning low-quality candidates and re-ranking hypotheses. This discriminator-guided generation enhances robustness without modifying model parameters. Experiments validate the effectiveness of our framework for robust and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://danny-nus.github.io/MSpoofTTS.github.io
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.