Effectiveness of Text, Acoustic, and Lattice-based representations in   Spoken Language Understanding tasks

Esa\'u Villatoro-Tello; Srikanth Madikeri; Juan Zuluaga-Gomez; Bidisha; Sharma; Seyyed Saeed Sarfjoo; Iuliia Nigmatulina; Petr Motlicek; Alexei V.; Ivanov; Aravind Ganapathiraju

arXiv:2212.08489·cs.CL·February 2, 2024

Effectiveness of Text, Acoustic, and Lattice-based representations in Spoken Language Understanding tasks

Esa\'u Villatoro-Tello, Srikanth Madikeri, Juan Zuluaga-Gomez, Bidisha, Sharma, Seyyed Saeed Sarfjoo, Iuliia Nigmatulina, Petr Motlicek, Alexei V., Ivanov, Aravind Ganapathiraju

PDF

Open Access 1 Repo

TL;DR

This paper evaluates various representations for intent classification in Spoken Language Understanding, comparing text-based, lattice-based, and multimodal systems, highlighting the benefits of richer ASR outputs and crossmodal approaches.

Contribution

It introduces a comprehensive benchmark of SLU systems using different representations, including a novel multimodal approach, and analyzes their performance under various conditions.

Findings

01

Richer ASR outputs improve SLU performance by 5.5%.

02

Crossmodal learning achieves 17.8% relative improvement over 1-best transcripts.

03

Multimodal approaches match oracle performance, overcoming transcript limitations.

Abstract

In this paper, we perform an exhaustive evaluation of different representations to address the intent classification problem in a Spoken Language Understanding (SLU) setup. We benchmark three types of systems to perform the SLU intent detection task: 1) text-based, 2) lattice-based, and a novel 3) multimodal approach. Our work provides a comprehensive analysis of what could be the achievable performance of different state-of-the-art SLU systems under different circumstances, e.g., automatically- vs. manually-generated transcripts. We evaluate the systems on the publicly available SLURP spoken language resource corpus. Our results indicate that using richer forms of Automatic Speech Recognition (ASR) outputs, namely word-consensus-networks, allows the SLU system to improve in comparison to the 1-best setup (5.5% relative improvement). However, crossmodal approaches, i.e., learning from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

idiap/slu_representations
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Natural Language Processing Techniques