Hierarchical Cross-Modality Knowledge Transfer with Sinkhorn Attention for CTC-based ASR
Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai

TL;DR
This paper introduces a hierarchical cross-modality knowledge transfer framework using Sinkhorn attention for CTC-based ASR, significantly improving character error rates by effectively transferring linguistic knowledge from pretrained language models.
Contribution
It proposes a novel cross-modality knowledge transfer method with Sinkhorn attention, enhancing linguistic knowledge encoding in acoustic models for speech recognition.
Findings
Achieved state-of-the-art CER of 3.64% on AISHELL-1
Demonstrated 34% relative CER reduction over baseline
Validated effectiveness of Sinkhorn attention in cross-modality alignment
Abstract
Due to the modality discrepancy between textual and acoustic modeling, efficiently transferring linguistic knowledge from a pretrained language model (PLM) to acoustic encoding for automatic speech recognition (ASR) still remains a challenging task. In this study, we propose a cross-modality knowledge transfer (CMKT) learning framework in a temporal connectionist temporal classification (CTC) based ASR system where hierarchical acoustic alignments with the linguistic representation are applied. Additionally, we propose the use of Sinkhorn attention in cross-modality alignment process, where the transformer attention is a special case of this Sinkhorn attention process. The CMKT learning is supposed to compel the acoustic encoder to encode rich linguistic knowledge for ASR. On the AISHELL-1 dataset, with CTC greedy decoding for inference (without using any language model), we achieved…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
