Spectral Codecs: Improving Non-Autoregressive Speech Synthesis with Spectrogram-Based Audio Codecs
Ryan Langman, Ante Juki\'c, Kunal Dhawan, Nithin Rao Koluguri, Jason Li

TL;DR
This paper introduces a spectral codec using Finite Scalar Quantization for mel-spectrogram compression, enhancing non-autoregressive speech synthesis by improving model performance and maintaining high audio quality.
Contribution
It proposes a novel spectral codec with FSQ for better spectrogram compression, improving non-autoregressive TTS models' performance and perceptual audio quality.
Findings
Spectral codec achieves comparable perceptual quality to existing codecs.
FSQ improves the efficiency of spectrogram compression.
Enhanced TTS model performance with spectral speech representations.
Abstract
Historically, most speech models in machine-learning have used the mel-spectrogram as a speech representation. Recently, discrete audio tokens produced by neural audio codecs have become a popular alternate speech representation for speech synthesis tasks such as text-to-speech (TTS). However, the data distribution produced by such codecs is too complex for some TTS models to predict, typically requiring large autoregressive models to get good quality. Most existing audio codecs use Residual Vector Quantization (RVQ) to compress and reconstruct the time-domain audio signal. We propose a spectral codec which uses Finite Scalar Quantization (FSQ) to compress the mel-spectrogram and reconstruct the time-domain audio signal. A study of objective audio quality metrics and subjective listening tests suggests that our spectral codec has comparable perceptual quality to equivalent audio codecs.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Advanced Data Compression Techniques · Speech Recognition and Synthesis
