Improving End-to-End Speech Processing by Efficient Text Data   Utilization with Latent Synthesis

Jianqiao Lu; Wenyong Huang; Nianzu Zheng; Xingshan Zeng; Yu Ting; Yeung; Xiao Chen

arXiv:2310.05374·cs.CL·October 25, 2023

Improving End-to-End Speech Processing by Efficient Text Data Utilization with Latent Synthesis

Jianqiao Lu, Wenyong Huang, Nianzu Zheng, Xingshan Zeng, Yu Ting, Yeung, Xiao Chen

PDF

Open Access

TL;DR

This paper introduces Latent Synthesis (LaSyn), a framework that efficiently utilizes textual data by converting it into pseudo acoustic representations, significantly improving end-to-end speech processing tasks like ASR and SLU, especially in low-resource settings.

Contribution

LaSyn is a novel framework that leverages a latent synthesizer to augment speech training data with pseudo acoustic representations derived from text, enhancing model performance.

Findings

01

LaSyn reduces word error rate by over 22.3% in ASR tasks.

02

LaSyn improves intent classification accuracy by 4.1% in SLU.

03

LaSyn achieves competitive results with fewer parameters.

Abstract

Training a high performance end-to-end speech (E2E) processing model requires an enormous amount of labeled speech data, especially in the era of data-centric artificial intelligence. However, labeled speech data are usually scarcer and more expensive for collection, compared to textual data. We propose Latent Synthesis (LaSyn), an efficient textual data utilization framework for E2E speech processing models. We train a latent synthesizer to convert textual data into an intermediate latent representation of a pre-trained speech model. These pseudo acoustic representations of textual data augment acoustic data for model training. We evaluate LaSyn on low-resource automatic speech recognition (ASR) and spoken language understanding (SLU) tasks. For ASR, LaSyn improves an E2E baseline trained on LibriSpeech train-clean-100, with relative word error rate reductions over 22.3% on different…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques