Codec-ASR: Training Performant Automatic Speech Recognition Systems with   Discrete Speech Representations

Kunal Dhawan; Nithin Rao Koluguri; Ante Juki\'c; Ryan Langman,; Jagadeesh Balam; Boris Ginsburg

arXiv:2407.03495·eess.AS·September 26, 2024

Codec-ASR: Training Performant Automatic Speech Recognition Systems with Discrete Speech Representations

Kunal Dhawan, Nithin Rao Koluguri, Ante Juki\'c, Ryan Langman,, Jagadeesh Balam, Boris Ginsburg

PDF

Open Access

TL;DR

This paper introduces a novel codec-based ASR pipeline that leverages discrete speech representations to improve performance, efficiency, and robustness, outperforming existing models across multiple benchmarks and languages.

Contribution

The work presents a comprehensive analysis of discrete speech representations for ASR, proposing a new codec ASR pipeline that surpasses state-of-the-art results with less data and smaller models.

Findings

01

Outperforms Encodec at similar bit-rate.

02

Surpasses state-of-the-art on 143-language ML-SUPERB benchmark.

03

Enhances noise robustness and training efficiency.

Abstract

Discrete speech representations have garnered recent attention for their efficacy in training transformer-based models for various speech-related tasks such as automatic speech recognition (ASR), translation, speaker verification, and joint speech-text foundational models. In this work, we present a comprehensive analysis on building ASR systems with discrete codes. We investigate different methods for codec training such as quantization schemes and time-domain vs spectral feature encodings. We further explore ASR training techniques aimed at enhancing performance, training efficiency, and noise robustness. Drawing upon our findings, we introduce a codec ASR pipeline that outperforms Encodec at similar bit-rate. Remarkably, it also surpasses the state-of-the-art results achieved by strong self-supervised models on the 143 languages ML-SUPERB benchmark despite being smaller in size and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsSoftmax · Attention Is All You Need