Leveraging Unpaired Text Data for Training End-to-End Speech-to-Intent   Systems

Yinghui Huang; Hong-Kwang Kuo; Samuel Thomas; Zvi Kons; Kartik; Audhkhasi; Brian Kingsbury; Ron Hoory; Michael Picheny

arXiv:2010.04284·cs.CL·October 12, 2020

Leveraging Unpaired Text Data for Training End-to-End Speech-to-Intent Systems

Yinghui Huang, Hong-Kwang Kuo, Samuel Thomas, Zvi Kons, Kartik, Audhkhasi, Brian Kingsbury, Ron Hoory, Michael Picheny

PDF

Open Access

TL;DR

This paper explores leveraging unpaired text data and transfer learning techniques to improve end-to-end speech-to-intent systems, reducing the need for large amounts of labeled speech data while maintaining high accuracy.

Contribution

It introduces methods to incorporate text resources into speech-to-intent models, including transfer learning with BERT embeddings and data augmentation via text-to-speech, achieving significant performance recovery.

Findings

01

Matching state-of-the-art performance with less data

02

Transfer learning improves intent classification accuracy

03

Data augmentation recovers 80% of performance loss

Abstract

Training an end-to-end (E2E) neural network speech-to-intent (S2I) system that directly extracts intents from speech requires large amounts of intent-labeled speech data, which is time consuming and expensive to collect. Initializing the S2I model with an ASR model trained on copious speech data can alleviate data sparsity. In this paper, we attempt to leverage NLU text resources. We implemented a CTC-based S2I system that matches the performance of a state-of-the-art, traditional cascaded SLU system. We performed controlled experiments with varying amounts of speech and text training data. When only a tenth of the original data is available, intent classification accuracy degrades by 7.6% absolute. Assuming we have additional text-to-intent data (without speech) available, we investigated two techniques to improve the S2I system: (1) transfer learning, in which acoustic embeddings for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques

MethodsLinear Layer · Adam · Dense Connections · WordPiece · Multi-Head Attention · Layer Normalization · Linear Warmup With Linear Decay · Attention Dropout · Weight Decay · Dropout