NanoCodec: Towards High-Quality Ultra Fast Speech LLM Inference
Edresson Casanova, Paarth Neekhara, Ryan Langman, Shehzeen Hussain, Subhankar Ghosh, Xuesong Yang, Ante Juki\'c, Jason Li, Boris Ginsburg

TL;DR
NanoCodec is a novel low frame-rate audio codec that enables high-quality, ultra-fast speech LLM inference by significantly reducing autoregressive steps, outperforming existing codecs in quality and efficiency.
Contribution
We introduce NanoCodec, a new audio codec operating at 12.5 FPS that improves speech compression quality and efficiency for LLM applications, setting a new benchmark.
Findings
NanoCodec achieves high-quality compression at 12.5 FPS.
NanoCodec outperforms related codecs across various bitrates.
NanoCodec enables faster and more efficient speech LLM inference.
Abstract
Large Language Models (LLMs) have significantly advanced audio processing by leveraging audio codecs to discretize audio into tokens, enabling the application of language modeling techniques to speech data. However, existing audio codecs often operate at high frame rates, leading to slow training and inference, particularly for autoregressive models. To address this, there is growing interest in low frame-rate audio codecs, which reduce the number of autoregressive steps required to generate one second of audio. In this paper, we conduct ablation studies to examine the impact of frame rate, bitrate, and causality on codec reconstruction quality. Based on our findings, we introduce NanoCodec, a state-of-the-art audio codec that achieves high-quality compression at just 12.5 frames per second (FPS). NanoCodec outperforms related works across various bitrate ranges, establishing a new…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗nvidia/nemo-nano-codec-22khz-1.78kbps-12.5fpsmodel· 2.4k dl· ♡ 102.4k dl♡ 10
- 🤗nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fpsmodel· 3.5k dl· ♡ 103.5k dl♡ 10
- 🤗nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fpsmodel· 3.2k dl· ♡ 163.2k dl♡ 16
- 🤗Knehm/nemo-nano-codec-22khz-0.6kbps-12.5fps-ONNXmodel· 4 dl4 dl
- 🤗Knehm/nemo-nano-codec-22khz-1.78kbps-12.5fps-ONNXmodel· 2 dl2 dl
- 🤗Knehm/nemo-nano-codec-22khz-1.89kbps-21.5fps-ONNXmodel· 27 dl27 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques
