ESPnet-Codec: Comprehensive Training and Evaluation of Neural Codecs for Audio, Music, and Speech
Jiatong Shi, Jinchuan Tian, Yihan Wu, Jee-weon Jung, Jia Qi Yip,, Yoshiki Masuyama, William Chen, Yuning Wu, Yuxun Tang, Massa Baali, Dareen, Alharhi, Dong Zhang, Ruifan Deng, Tejes Srivastava, Haibin Wu, Alexander H., Liu, Bhiksha Raj, Qin Jin, Ruihua Song, Shinji Watanabe

TL;DR
ESPnet-Codec is an open-source platform for training and evaluating neural codecs across audio, music, and speech, providing comprehensive metrics and supporting diverse applications to advance audio generation research.
Contribution
It introduces ESPnet-Codec, a versatile toolkit for neural codec training and evaluation, along with VERSA for extensive performance assessment, enabling fair comparisons and broad application support.
Findings
Supports integration into six ESPnet tasks.
Provides evaluation over 20 audio metrics.
Facilitates fair comparison across diverse applications.
Abstract
Neural codecs have become crucial to recent speech and audio generation research. In addition to signal compression capabilities, discrete codecs have also been found to enhance downstream training efficiency and compatibility with autoregressive language models. However, as extensive downstream applications are investigated, challenges have arisen in ensuring fair comparisons across diverse applications. To address these issues, we present a new open-source platform ESPnet-Codec, which is built on ESPnet and focuses on neural codec training and evaluation. ESPnet-Codec offers various recipes in audio, music, and speech for training and evaluation using several widely adopted codec models. Together with ESPnet-Codec, we present VERSA, a standalone evaluation toolkit, which provides a comprehensive evaluation of codec performance over 20 audio evaluation metrics. Notably, we demonstrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
MethodsDilated Convolution · Hierarchical Feature Fusion · Kaiming Initialization · Pointwise Convolution · Convolution · Efficient Spatial Pyramid · 1x1 Convolution · Parameterized ReLU · ESPNet
