Multimodal Audio-textual Architecture for Robust Spoken Language Understanding
Anderson R. Avila, Mehdi Rezagholizadeh, Chao Xing

TL;DR
This paper introduces a multimodal audio-textual architecture that enhances spoken language understanding by reducing errors propagated from automatic speech recognition, leveraging self-supervised features from both modalities.
Contribution
The work proposes a novel multimodal language understanding module combining audio and text encoders with late fusion, improving robustness against ASR errors in SLU tasks.
Findings
MLU outperforms pure PLM models on multiple SLU datasets.
The approach is robust to poor-quality ASR transcripts.
It surpasses existing models in accuracy across various ASR engines.
Abstract
Recent voice assistants are usually based on the cascade spoken language understanding (SLU) solution, which consists of an automatic speech recognition (ASR) engine and a natural language understanding (NLU) system. Because such approach relies on the ASR output, it often suffers from the so-called ASR error propagation. In this work, we investigate impacts of this ASR error propagation on state-of-the-art NLU systems based on pre-trained language models (PLM), such as BERT and RoBERTa. Moreover, a multimodal language understanding (MLU) module is proposed to mitigate SLU performance degradation caused by errors present in the ASR transcript. The MLU benefits from self-supervised features learned from both audio and text modalities, specifically Wav2Vec for speech and Bert/RoBERTa for language. Our MLU combines an encoder network to embed the audio signal and a text encoder to process…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Linear Warmup With Linear Decay · Layer Normalization · Weight Decay · Residual Connection · Softmax · Adam
