Building Robust Spoken Language Understanding by Cross Attention between   Phoneme Sequence and ASR Hypothesis

Zexun Wang; Yuquan Le; Yi Zhu; Yuming Zhao; Mingchao Feng; Meng Chen,; Xiaodong He

arXiv:2203.12067·cs.CL·March 24, 2022

Building Robust Spoken Language Understanding by Cross Attention between Phoneme Sequence and ASR Hypothesis

Zexun Wang, Yuquan Le, Yi Zhu, Yuming Zhao, Mingchao Feng, Meng Chen,, Xiaodong He

PDF

Open Access

TL;DR

This paper introduces CASLU, a cross-attention based model that leverages phoneme sequences and ASR hypotheses to improve the robustness of spoken language understanding against recognition errors, validated across multiple datasets.

Contribution

The paper presents a novel cross-attention model that captures phonetic and semantic features jointly, enhancing SLU robustness to ASR errors and demonstrating its universality and complementarity.

Findings

01

CASLU outperforms baseline models on three datasets.

02

The model effectively captures phonetic and semantic interactions.

03

CASLU complements existing robust SLU techniques.

Abstract

Building Spoken Language Understanding (SLU) robust to Automatic Speech Recognition (ASR) errors is an essential issue for various voice-enabled virtual assistants. Considering that most ASR errors are caused by phonetic confusion between similar-sounding expressions, intuitively, leveraging the phoneme sequence of speech can complement ASR hypothesis and enhance the robustness of SLU. This paper proposes a novel model with Cross Attention for SLU (denoted as CASLU). The cross attention block is devised to catch the fine-grained interactions between phoneme and word embeddings in order to make the joint representations catch the phonetic and semantic features of input simultaneously and for overcoming the ASR errors in downstream natural language understanding (NLU) tasks. Extensive experiments are conducted on three datasets, showing the effectiveness and competitiveness of our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Speech and dialogue systems