A Study of Different Ways to Use The Conformer Model For Spoken Language   Understanding

Nick J.C. Wang; Shaojun Wang; Jing Xiao

arXiv:2204.03879·cs.CL·April 11, 2022

A Study of Different Ways to Use The Conformer Model For Spoken Language Understanding

Nick J.C. Wang, Shaojun Wang, Jing Xiao

PDF

Open Access

TL;DR

This paper explores various configurations of the Conformer model for spoken language understanding, proposing a novel sequence summarization technique that enhances accuracy and efficiency in end-to-end SLU systems.

Contribution

It introduces a new connectionist temporal summarization method and compares different Conformer-based approaches for SLU, highlighting that system optimization is crucial regardless of the architecture.

Findings

01

CTS improves accuracy and speed of end-to-end SLU models.

02

End-to-end models can match two-stage systems in intent recognition.

03

Optimizing each component remains essential for best performance.

Abstract

SLU combines ASR and NLU capabilities to accomplish speech-to-intent understanding. In this paper, we compare different ways to combine ASR and NLU, in particular using a single Conformer model with different ways to use its components, to better understand the strengths and weaknesses of each approach. We find that it is not necessarily a choice between two-stage decoding and end-to-end systems which determines the best system for research or application. System optimization still entails carefully improving the performance of each component. It is difficult to prove that one direction is conclusively better than the other. In this paper, we also propose a novel connectionist temporal summarization (CTS) method to reduce the length of acoustic encoding sequences while improving the accuracy and processing speed of end-to-end models. This method achieves the same intent accuracy as the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Music and Audio Processing

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings