TI-ASU: Toward Robust Automatic Speech Understanding through   Text-to-speech Imputation Against Missing Speech Modality

Tiantian Feng; Xuan Shi; Rahul Gupta; Shrikanth S. Narayanan

arXiv:2404.17983·cs.SD·April 30, 2024

TI-ASU: Toward Robust Automatic Speech Understanding through Text-to-speech Imputation Against Missing Speech Modality

Tiantian Feng, Xuan Shi, Rahul Gupta, Shrikanth S. Narayanan

PDF

Open Access

TL;DR

This paper introduces TI-ASU, a method that uses text-to-speech imputation to enable robust automatic speech understanding when large portions of speech data are missing, enhancing model resilience and privacy.

Contribution

We propose TI-ASU, a novel approach that leverages pre-trained TTS models to impute missing speech, improving ASU performance under high missing data scenarios.

Findings

01

TI-ASU improves ASU accuracy with up to 95% missing speech data.

02

TI-ASU enhances robustness in dropout training scenarios.

03

Experiments demonstrate effectiveness across multi- and single-modality settings.

Abstract

Automatic Speech Understanding (ASU) aims at human-like speech interpretation, providing nuanced intent, emotion, sentiment, and content understanding from speech and language (text) content conveyed in speech. Typically, training a robust ASU model relies heavily on acquiring large-scale, high-quality speech and associated transcriptions. However, it is often challenging to collect or use speech data for training ASU due to concerns such as privacy. To approach this setting of enabling ASU when speech (audio) modality is missing, we propose TI-ASU, using a pre-trained text-to-speech model to impute the missing speech. We report extensive experiments evaluating TI-ASU on various missing scales, both multi- and single-modality settings, and the use of LLMs. Our findings show that TI-ASU yields substantial benefits to improve ASU in scenarios where even up to 95% of training speech is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems

MethodsAmplifying Sine Unit: An Oscillatory Activation Function for Deep Neural Networks to Recover Nonlinear Oscillations Efficiently · Dropout