Codec-ASR: Training Performant Automatic Speech Recognition Systems with Discrete Speech Representations
Kunal Dhawan, Nithin Rao Koluguri, Ante Juki\'c, Ryan Langman,, Jagadeesh Balam, Boris Ginsburg

TL;DR
This paper introduces a novel codec-based ASR pipeline that leverages discrete speech representations to improve performance, efficiency, and robustness, outperforming existing models across multiple benchmarks and languages.
Contribution
The work presents a comprehensive analysis of discrete speech representations for ASR, proposing a new codec ASR pipeline that surpasses state-of-the-art results with less data and smaller models.
Findings
Outperforms Encodec at similar bit-rate.
Surpasses state-of-the-art on 143-language ML-SUPERB benchmark.
Enhances noise robustness and training efficiency.
Abstract
Discrete speech representations have garnered recent attention for their efficacy in training transformer-based models for various speech-related tasks such as automatic speech recognition (ASR), translation, speaker verification, and joint speech-text foundational models. In this work, we present a comprehensive analysis on building ASR systems with discrete codes. We investigate different methods for codec training such as quantization schemes and time-domain vs spectral feature encodings. We further explore ASR training techniques aimed at enhancing performance, training efficiency, and noise robustness. Drawing upon our findings, we introduce a codec ASR pipeline that outperforms Encodec at similar bit-rate. Remarkably, it also surpasses the state-of-the-art results achieved by strong self-supervised models on the 143 languages ML-SUPERB benchmark despite being smaller in size and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsSoftmax · Attention Is All You Need
