A Study of Different Ways to Use The Conformer Model For Spoken Language Understanding
Nick J.C. Wang, Shaojun Wang, Jing Xiao

TL;DR
This paper explores various configurations of the Conformer model for spoken language understanding, proposing a novel sequence summarization technique that enhances accuracy and efficiency in end-to-end SLU systems.
Contribution
It introduces a new connectionist temporal summarization method and compares different Conformer-based approaches for SLU, highlighting that system optimization is crucial regardless of the architecture.
Findings
CTS improves accuracy and speed of end-to-end SLU models.
End-to-end models can match two-stage systems in intent recognition.
Optimizing each component remains essential for best performance.
Abstract
SLU combines ASR and NLU capabilities to accomplish speech-to-intent understanding. In this paper, we compare different ways to combine ASR and NLU, in particular using a single Conformer model with different ways to use its components, to better understand the strengths and weaknesses of each approach. We find that it is not necessarily a choice between two-stage decoding and end-to-end systems which determines the best system for research or application. System optimization still entails carefully improving the performance of each component. It is difficult to prove that one direction is conclusively better than the other. In this paper, we also propose a novel connectionist temporal summarization (CTS) method to reduce the length of acoustic encoding sequences while improving the accuracy and processing speed of end-to-end models. This method achieves the same intent accuracy as the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Music and Audio Processing
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
