Low Frame-rate Speech Codec: a Codec Designed for Fast High-quality Speech LLM Training and Inference
Edresson Casanova, Ryan Langman, Paarth Neekhara, Shehzeen Hussain,, Jason Li, Subhankar Ghosh, Ante Juki\'c, Sang-gil Lee

TL;DR
This paper introduces LFSC, a neural speech codec operating at low frame rates, enabling faster and high-quality audio processing for speech LLMs, significantly improving inference speed and maintaining audio quality.
Contribution
The paper presents a novel low frame-rate speech codec that enhances inference speed for speech LLMs without sacrificing audio quality, using scalar quantization and adversarial training.
Findings
Achieves 1.89 kbps bitrate at 21.5 fps.
Increases inference speed of speech LLMs by three times.
Maintains high audio quality comparable to existing codecs.
Abstract
Large language models (LLMs) have significantly advanced audio processing through audio codecs that convert audio into discrete tokens, enabling the application of language modeling techniques to audio data. However, audio codecs often operate at high frame rates, resulting in slow training and inference, especially for autoregressive models. To address this challenge, we present the Low Frame-rate Speech Codec (LFSC): a neural audio codec that leverages finite scalar quantization and adversarial training with large speech language models to achieve high-quality audio compression with a 1.89 kbps bitrate and 21.5 frames per second. We demonstrate that our novel codec can make the inference of LLM-based text-to-speech models around three times faster while improving intelligibility and producing quality comparable to previous models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗nvidia/low-frame-rate-speech-codec-22khzmodel· 172 dl· ♡ 19172 dl♡ 19
- 🤗nvidia/nemo-nano-codec-22khz-1.78kbps-12.5fpsmodel· 2.4k dl· ♡ 102.4k dl♡ 10
- 🤗nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fpsmodel· 3.5k dl· ♡ 103.5k dl♡ 10
- 🤗nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fpsmodel· 3.2k dl· ♡ 163.2k dl♡ 16
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Advanced Data Compression Techniques
