Cross-modal Alignment with Optimal Transport for CTC-based ASR

Xugang Lu; Peng Shen; Yu Tsao; Hisashi Kawai

arXiv:2309.13650·eess.AS·September 26, 2023

Cross-modal Alignment with Optimal Transport for CTC-based ASR

Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai

PDF

Open Access

TL;DR

This paper introduces a novel optimal transport-based cross-modal alignment method to transfer linguistic knowledge from pretrained language models to CTC-based speech recognition, significantly improving accuracy.

Contribution

It proposes a new cross-modal alignment algorithm using optimal transport to transfer linguistic knowledge from PLMs to acoustic models in CTC-based ASR.

Findings

01

Achieved 3.96% CER on AISHELL-1 dev set.

02

Improved CER by approximately 28% over baseline.

03

Demonstrated effective cross-modal knowledge transfer.

Abstract

Temporal connectionist temporal classification (CTC)-based automatic speech recognition (ASR) is one of the most successful end to end (E2E) ASR frameworks. However, due to the token independence assumption in decoding, an external language model (LM) is required which destroys its fast parallel decoding property. Several studies have been proposed to transfer linguistic knowledge from a pretrained LM (PLM) to the CTC based ASR. Since the PLM is built from text while the acoustic model is trained with speech, a cross-modal alignment is required in order to transfer the context dependent linguistic knowledge from the PLM to acoustic encoding. In this study, we propose a novel cross-modal alignment algorithm based on optimal transport (OT). In the alignment process, a transport coupling matrix is obtained using OT, which is then utilized to transform a latent acoustic representation for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing