Speech to Text Adaptation: Towards an Efficient Cross-Modal Distillation
Won Ik Cho, Donghyun Kwak, Ji Won Yoon, Nam Soo Kim

TL;DR
This paper proposes a cross-modal distillation approach to transfer knowledge from pre-trained language models to speech understanding modules, improving performance on spoken language understanding tasks with limited data.
Contribution
It introduces a novel method for transferring knowledge from Transformer-based text LMs to speech modules, enhancing SLU performance in data-scarce scenarios.
Findings
Improved SLU accuracy on the Fluent Speech Command benchmark.
Effective knowledge transfer from text LMs to speech modules.
Validation of the hypothesis that semantic information can be shared across modalities.
Abstract
Speech is one of the most effective means of communication and is full of information that helps the transmission of utterer's thoughts. However, mainly due to the cumbersome processing of acoustic features, phoneme or word posterior probability has frequently been discarded in understanding the natural language. Thus, some recent spoken language understanding (SLU) modules have utilized end-to-end structures that preserve the uncertainty information. This further reduces the propagation of speech recognition error and guarantees computational efficiency. We claim that in this process, the speech comprehension can benefit from the inference of massive pre-trained language models (LMs). We transfer the knowledge from a concrete Transformer-based text LM to an SLU module which can face a data shortage, based on recent cross-modal distillation methodologies. We demonstrate the validity of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
