Leveraging Acoustic and Linguistic Embeddings from Pretrained speech and language Models for Intent Classification
Bidisha Sharma, Maulik Madhavi, Haizhou Li

TL;DR
This paper introduces a novel intent classification framework that combines acoustic features from speech recognition models with linguistic features from language models, using knowledge distillation and cross-attention, achieving high accuracy on benchmark datasets.
Contribution
It proposes a new method that integrates acoustic and linguistic embeddings for intent classification, leveraging pretrained models and knowledge distillation for improved performance.
Findings
Achieved 90.86% accuracy on ATIS dataset.
Achieved 99.07% accuracy on Fluent speech corpus.
Demonstrated effectiveness of combining acoustic and linguistic features.
Abstract
Intent classification is a task in spoken language understanding. An intent classification system is usually implemented as a pipeline process, with a speech recognition module followed by text processing that classifies the intents. There are also studies of end-to-end system that takes acoustic features as input and classifies the intents directly. Such systems don't take advantage of relevant linguistic information, and suffer from limited training data. In this work, we propose a novel intent classification framework that employs acoustic features extracted from a pretrained speech recognition system and linguistic features learned from a pretrained language model. We use knowledge distillation technique to map the acoustic embeddings towards linguistic embeddings. We perform fusion of both acoustic and linguistic embeddings through cross-attention approach to classify intents. With…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing
MethodsKnowledge Distillation
