Spectral Codecs: Improving Non-Autoregressive Speech Synthesis with Spectrogram-Based Audio Codecs

Ryan Langman; Ante Juki\'c; Kunal Dhawan; Nithin Rao Koluguri; Jason Li

arXiv:2406.05298·eess.AS·June 5, 2025

Spectral Codecs: Improving Non-Autoregressive Speech Synthesis with Spectrogram-Based Audio Codecs

Ryan Langman, Ante Juki\'c, Kunal Dhawan, Nithin Rao Koluguri, Jason Li

PDF

Open Access 4 Models

TL;DR

This paper introduces a spectral codec using Finite Scalar Quantization for mel-spectrogram compression, enhancing non-autoregressive speech synthesis by improving model performance and maintaining high audio quality.

Contribution

It proposes a novel spectral codec with FSQ for better spectrogram compression, improving non-autoregressive TTS models' performance and perceptual audio quality.

Findings

01

Spectral codec achieves comparable perceptual quality to existing codecs.

02

FSQ improves the efficiency of spectrogram compression.

03

Enhanced TTS model performance with spectral speech representations.

Abstract

Historically, most speech models in machine-learning have used the mel-spectrogram as a speech representation. Recently, discrete audio tokens produced by neural audio codecs have become a popular alternate speech representation for speech synthesis tasks such as text-to-speech (TTS). However, the data distribution produced by such codecs is too complex for some TTS models to predict, typically requiring large autoregressive models to get good quality. Most existing audio codecs use Residual Vector Quantization (RVQ) to compress and reconstruct the time-domain audio signal. We propose a spectral codec which uses Finite Scalar Quantization (FSQ) to compress the mel-spectrogram and reconstruct the time-domain audio signal. A study of objective audio quality metrics and subjective listening tests suggests that our spectral codec has comparable perceptual quality to equivalent audio codecs.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Advanced Data Compression Techniques · Speech Recognition and Synthesis