Hierarchical Cross-Modality Knowledge Transfer with Sinkhorn Attention   for CTC-based ASR

Xugang Lu; Peng Shen; Yu Tsao; Hisashi Kawai

arXiv:2309.16093·eess.AS·September 29, 2023

Hierarchical Cross-Modality Knowledge Transfer with Sinkhorn Attention for CTC-based ASR

Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai

PDF

Open Access

TL;DR

This paper introduces a hierarchical cross-modality knowledge transfer framework using Sinkhorn attention for CTC-based ASR, significantly improving character error rates by effectively transferring linguistic knowledge from pretrained language models.

Contribution

It proposes a novel cross-modality knowledge transfer method with Sinkhorn attention, enhancing linguistic knowledge encoding in acoustic models for speech recognition.

Findings

01

Achieved state-of-the-art CER of 3.64% on AISHELL-1

02

Demonstrated 34% relative CER reduction over baseline

03

Validated effectiveness of Sinkhorn attention in cross-modality alignment

Abstract

Due to the modality discrepancy between textual and acoustic modeling, efficiently transferring linguistic knowledge from a pretrained language model (PLM) to acoustic encoding for automatic speech recognition (ASR) still remains a challenging task. In this study, we propose a cross-modality knowledge transfer (CMKT) learning framework in a temporal connectionist temporal classification (CTC) based ASR system where hierarchical acoustic alignments with the linguistic representation are applied. Additionally, we propose the use of Sinkhorn attention in cross-modality alignment process, where the transformer attention is a special case of this Sinkhorn attention process. The CMKT learning is supposed to compel the acoustic encoder to encode rich linguistic knowledge for ASR. On the AISHELL-1 dataset, with CTC greedy decoding for inference (without using any language model), we achieved…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing