A context-aware knowledge transferring strategy for CTC-based ASR
Ke-Han Lu, Kuan-Yu Chen

TL;DR
This paper introduces a context-aware knowledge transfer approach for CTC-based ASR that incorporates linguistic information from language models to overcome the independence assumption limitation, improving recognition performance.
Contribution
It proposes a novel knowledge transferring module and context-aware training strategy to enhance CTC-based ASR by integrating linguistic context from pre-trained language models.
Findings
Improved accuracy on AISHELL datasets
Effective mitigation of token independence assumption
Enhanced performance with knowledge injection
Abstract
Non-autoregressive automatic speech recognition (ASR) modeling has received increasing attention recently because of its fast decoding speed and superior performance. Among representatives, methods based on the connectionist temporal classification (CTC) are still a dominating stream. However, the theoretically inherent flaw, the assumption of independence between tokens, creates a performance barrier for the school of works. To mitigate the challenge, we propose a context-aware knowledge transferring strategy, consisting of a knowledge transferring module and a context-aware training strategy, for CTC-based ASR. The former is designed to distill linguistic information from a pre-trained language model, and the latter is framed to modulate the limitations caused by the conditional independence assumption. As a result, a knowledge-injected context-aware CTC-based ASR built upon the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
