TL;DR
This paper introduces MSpoof-TTS, a training-free hierarchical decoding framework that uses multi-resolution spoof detection to improve zero-shot discrete speech synthesis quality and robustness.
Contribution
It proposes a novel multi-resolution spoof detection method integrated into hierarchical decoding, enhancing speech synthesis without retraining or preference optimization.
Findings
Improved perceptual realism in zero-shot speech synthesis.
Effective detection of unnatural patterns at multiple temporal scales.
Enhanced robustness against token-level artifacts.
Abstract
Neural codec language models enable high-quality discrete speech synthesis, yet their inference remains vulnerable to token-level artifacts and distributional drift that degrade perceptual realism. Rather than relying on preference optimization or retraining, we propose MSpoof-TTS, a training-free inference framework that improves zero-shot synthesis through multi-resolution spoof guidance. We introduce a Multi-Resolution Token-based Spoof Detection framework that evaluates codec sequences at different temporal granularities to detect locally inconsistent or unnatural patterns. We then integrate the spoof detectors into a hierarchical decoding strategy, progressively pruning low-quality candidates and re-ranking hypotheses. This discriminator-guided generation enhances robustness without modifying model parameters. Experiments validate the effectiveness of our framework for robust and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
