Cross-modal Alignment with Optimal Transport for CTC-based ASR
Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai

TL;DR
This paper introduces a novel optimal transport-based cross-modal alignment method to transfer linguistic knowledge from pretrained language models to CTC-based speech recognition, significantly improving accuracy.
Contribution
It proposes a new cross-modal alignment algorithm using optimal transport to transfer linguistic knowledge from PLMs to acoustic models in CTC-based ASR.
Findings
Achieved 3.96% CER on AISHELL-1 dev set.
Improved CER by approximately 28% over baseline.
Demonstrated effective cross-modal knowledge transfer.
Abstract
Temporal connectionist temporal classification (CTC)-based automatic speech recognition (ASR) is one of the most successful end to end (E2E) ASR frameworks. However, due to the token independence assumption in decoding, an external language model (LM) is required which destroys its fast parallel decoding property. Several studies have been proposed to transfer linguistic knowledge from a pretrained LM (PLM) to the CTC based ASR. Since the PLM is built from text while the acoustic model is trained with speech, a cross-modal alignment is required in order to transfer the context dependent linguistic knowledge from the PLM to acoustic encoding. In this study, we propose a novel cross-modal alignment algorithm based on optimal transport (OT). In the alignment process, a transport coupling matrix is obtained using OT, which is then utilized to transform a latent acoustic representation for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
