Leveraging Pretrained ASR Encoders for Effective and Efficient   End-to-End Speech Intent Classification and Slot Filling

He Huang; Jagadeesh Balam; Boris Ginsburg

arXiv:2307.07057·cs.CL·July 17, 2023

Leveraging Pretrained ASR Encoders for Effective and Efficient End-to-End Speech Intent Classification and Slot Filling

He Huang, Jagadeesh Balam, Boris Ginsburg

PDF

Open Access 2 Models

TL;DR

This paper demonstrates that using an ASR-pretrained encoder significantly improves end-to-end speech intent classification and slot filling, achieving state-of-the-art results and parameter efficiency compared to SSL pretraining and cascading models.

Contribution

It introduces an end-to-end model initialized with an ASR-pretrained encoder, showing superior performance and efficiency over SSL pretraining and traditional cascading approaches.

Findings

01

ASR pretraining outperforms SSL for SICSF.

02

Parameter efficiency is achieved with frozen ASR-pretrained encoder and adapters.

03

E2E models outperform cascaded models unless oracle ASR is used.

Abstract

We study speech intent classification and slot filling (SICSF) by proposing to use an encoder pretrained on speech recognition (ASR) to initialize an end-to-end (E2E) Conformer-Transformer model, which achieves the new state-of-the-art results on the SLURP dataset, with 90.14% intent accuracy and 82.27% SLURP-F1. We compare our model with encoders pretrained on self-supervised learning (SSL), and show that ASR pretraining is much more effective than SSL for SICSF. To explore parameter efficiency, we freeze the encoder and add Adapter modules, and show that parameter efficiency is only achievable with an ASR-pretrained encoder, while the SSL encoder needs full finetuning to achieve comparable results. In addition, we provide an in-depth comparison on end-to-end models versus cascading models (ASR+NLU), and show that E2E models are better than cascaded models unless an oracle ASR model is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsAdapter