ERVQ: Enhanced Residual Vector Quantization with Intra-and-Inter-Codebook Optimization for Neural Audio Codecs

Rui-Chen Zheng; Hui-Peng Du; Xiao-Hang Jiang; Yang Ai; Zhen-Hua Ling

arXiv:2410.12359·eess.AS·June 12, 2025

ERVQ: Enhanced Residual Vector Quantization with Intra-and-Inter-Codebook Optimization for Neural Audio Codecs

Rui-Chen Zheng, Hui-Peng Du, Xiao-Hang Jiang, Yang Ai, Zhen-Hua Ling

PDF

Open Access

TL;DR

This paper introduces ERVQ, a novel enhancement for neural audio codecs that mitigates codebook collapse through intra- and inter-codebook optimization, significantly improving audio quality and generalization.

Contribution

ERVQ is the first method to effectively address codebook collapse in neural audio codecs using combined intra- and inter-codebook strategies.

Findings

01

Achieves 100% codebook utilization in advanced neural codecs.

02

Significantly improves audio quality across models, rates, and sampling frequencies.

03

Enhances downstream speech synthesis and TTS performance.

Abstract

Current neural audio codecs typically use residual vector quantization (RVQ) to discretize speech signals. However, they often experience codebook collapse, which reduces the effective codebook size and leads to suboptimal performance. To address this problem, we introduce ERVQ, Enhanced Residual Vector Quantization, a novel enhancement strategy for the RVQ framework in neural audio codecs. ERVQ mitigates codebook collapse and boosts codec performance through both intra- and inter-codebook optimization. Intra-codebook optimization incorporates an online clustering strategy and a code balancing loss to ensure balanced and efficient codebook utilization. Inter-codebook optimization improves the diversity of quantized features by minimizing the similarity between successive quantizations. Our experiments show that ERVQ significantly enhances audio codec performance across different models,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing