Fusion of Discrete Representations and Self-Augmented Representations   for Multilingual Automatic Speech Recognition

Shih-heng Wang; Jiatong Shi; Chien-yu Huang; Shinji Watanabe; Hung-yi; Lee

arXiv:2411.18107·cs.SD·November 28, 2024

Fusion of Discrete Representations and Self-Augmented Representations for Multilingual Automatic Speech Recognition

Shih-heng Wang, Jiatong Shi, Chien-yu Huang, Shinji Watanabe, Hung-yi, Lee

PDF

Open Access

TL;DR

This paper introduces a novel fusion mechanism for discrete SSL speech representations and explores self-augmented discrete representations, significantly improving multilingual ASR performance while reducing computational costs.

Contribution

It presents a new fusion approach for discrete SSL representations and introduces self-augmented discrete representations, enhancing ASR accuracy and efficiency.

Findings

01

Up to 19% relative CER improvement on LibriSpeech.

02

Up to 24% relative CER improvement on ML-SUPERB.

03

Effective fusion of discrete representations boosts ASR performance.

Abstract

Self-supervised learning (SSL) models have shown exceptional capabilities across various speech-processing tasks. Continuous SSL representations are effective but suffer from high computational and storage demands. On the other hand, discrete SSL representations, although with degraded performance, reduce transmission and storage costs, and improve input sequence efficiency through de-duplication and subword-modeling. To boost the performance of discrete representations for ASR, we introduce a novel fusion mechanism that integrates two discrete representations. The fusion mechanism preserves all the benefits of discrete representation while enhancing the model's performance by integrating complementary information. Additionally, we explore "self-augmented'' discrete representations, which apply transformations to a single continuous SSL representation, eliminating the fusion mechanism's…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis