Multimodal Audio-textual Architecture for Robust Spoken Language   Understanding

Anderson R. Avila; Mehdi Rezagholizadeh; Chao Xing

arXiv:2306.06819·cs.CL·June 14, 2023·1 cites

Multimodal Audio-textual Architecture for Robust Spoken Language Understanding

Anderson R. Avila, Mehdi Rezagholizadeh, Chao Xing

PDF

Open Access

TL;DR

This paper introduces a multimodal audio-textual architecture that enhances spoken language understanding by reducing errors propagated from automatic speech recognition, leveraging self-supervised features from both modalities.

Contribution

The work proposes a novel multimodal language understanding module combining audio and text encoders with late fusion, improving robustness against ASR errors in SLU tasks.

Findings

01

MLU outperforms pure PLM models on multiple SLU datasets.

02

The approach is robust to poor-quality ASR transcripts.

03

It surpasses existing models in accuracy across various ASR engines.

Abstract

Recent voice assistants are usually based on the cascade spoken language understanding (SLU) solution, which consists of an automatic speech recognition (ASR) engine and a natural language understanding (NLU) system. Because such approach relies on the ASR output, it often suffers from the so-called ASR error propagation. In this work, we investigate impacts of this ASR error propagation on state-of-the-art NLU systems based on pre-trained language models (PLM), such as BERT and RoBERTa. Moreover, a multimodal language understanding (MLU) module is proposed to mitigate SLU performance degradation caused by errors present in the ASR transcript. The MLU benefits from self-supervised features learned from both audio and text modalities, specifically Wav2Vec for speech and Bert/RoBERTa for language. Our MLU combines an encoder network to embed the audio signal and a text encoder to process…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Linear Warmup With Linear Decay · Layer Normalization · Weight Decay · Residual Connection · Softmax · Adam