Leveraging Unpaired Text Data for Training End-to-End Speech-to-Intent Systems
Yinghui Huang, Hong-Kwang Kuo, Samuel Thomas, Zvi Kons, Kartik, Audhkhasi, Brian Kingsbury, Ron Hoory, Michael Picheny

TL;DR
This paper explores leveraging unpaired text data and transfer learning techniques to improve end-to-end speech-to-intent systems, reducing the need for large amounts of labeled speech data while maintaining high accuracy.
Contribution
It introduces methods to incorporate text resources into speech-to-intent models, including transfer learning with BERT embeddings and data augmentation via text-to-speech, achieving significant performance recovery.
Findings
Matching state-of-the-art performance with less data
Transfer learning improves intent classification accuracy
Data augmentation recovers 80% of performance loss
Abstract
Training an end-to-end (E2E) neural network speech-to-intent (S2I) system that directly extracts intents from speech requires large amounts of intent-labeled speech data, which is time consuming and expensive to collect. Initializing the S2I model with an ASR model trained on copious speech data can alleviate data sparsity. In this paper, we attempt to leverage NLU text resources. We implemented a CTC-based S2I system that matches the performance of a state-of-the-art, traditional cascaded SLU system. We performed controlled experiments with varying amounts of speech and text training data. When only a tenth of the original data is available, intent classification accuracy degrades by 7.6% absolute. Assuming we have additional text-to-intent data (without speech) available, we investigated two techniques to improve the S2I system: (1) transfer learning, in which acoustic embeddings for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
MethodsLinear Layer · Adam · Dense Connections · WordPiece · Multi-Head Attention · Layer Normalization · Linear Warmup With Linear Decay · Attention Dropout · Weight Decay · Dropout
