Speak or Chat with Me: End-to-End Spoken Language Understanding System with Flexible Inputs
Sujeong Cha, Wangrui Hou, Hyun Jung, My Phung, Michael Picheny,, Hong-Kwang Kuo, Samuel Thomas, Edmilson Morais

TL;DR
This paper introduces a flexible end-to-end spoken language understanding system capable of predicting intents from speech, transcripts, or both, leveraging pre-trained models and cross-modal training to improve performance and address data scarcity.
Contribution
The paper presents a novel multi-input SLU system that combines speech and text modalities using pre-trained models and cross-modal training, enhancing robustness and performance.
Findings
Achieves strong intent classification performance on SLU datasets.
Combines speech and transcript inputs for improved accuracy.
Utilizes pre-trained acoustic and text models for better generalization.
Abstract
A major focus of recent research in spoken language understanding (SLU) has been on the end-to-end approach where a single model can predict intents directly from speech inputs without intermediate transcripts. However, this approach presents some challenges. First, since speech can be considered as personally identifiable information, in some cases only automatic speech recognition (ASR) transcripts are accessible. Second, intent-labeled speech data is scarce. To address the first challenge, we propose a novel system that can predict intents from flexible types of inputs: speech, ASR transcripts, or both. We demonstrate strong performance for either modality separately, and when both speech and ASR transcripts are available, through system combination, we achieve better results than using a single input modality. To address the second challenge, we leverage a semantically robust…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsAttention Is All You Need · Linear Layer · Refunds@Expedia|||How do I get a full refund from Expedia? · Softmax · Linear Warmup With Linear Decay · Weight Decay · Adam · WordPiece · Dropout · Layer Normalization
