Leveraging Pretrained ASR Encoders for Effective and Efficient End-to-End Speech Intent Classification and Slot Filling
He Huang, Jagadeesh Balam, Boris Ginsburg

TL;DR
This paper demonstrates that using an ASR-pretrained encoder significantly improves end-to-end speech intent classification and slot filling, achieving state-of-the-art results and parameter efficiency compared to SSL pretraining and cascading models.
Contribution
It introduces an end-to-end model initialized with an ASR-pretrained encoder, showing superior performance and efficiency over SSL pretraining and traditional cascading approaches.
Findings
ASR pretraining outperforms SSL for SICSF.
Parameter efficiency is achieved with frozen ASR-pretrained encoder and adapters.
E2E models outperform cascaded models unless oracle ASR is used.
Abstract
We study speech intent classification and slot filling (SICSF) by proposing to use an encoder pretrained on speech recognition (ASR) to initialize an end-to-end (E2E) Conformer-Transformer model, which achieves the new state-of-the-art results on the SLURP dataset, with 90.14% intent accuracy and 82.27% SLURP-F1. We compare our model with encoders pretrained on self-supervised learning (SSL), and show that ASR pretraining is much more effective than SSL for SICSF. To explore parameter efficiency, we freeze the encoder and add Adapter modules, and show that parameter efficiency is only achievable with an ASR-pretrained encoder, while the SSL encoder needs full finetuning to achieve comparable results. In addition, we provide an in-depth comparison on end-to-end models versus cascading models (ASR+NLU), and show that E2E models are better than cascaded models unless an oracle ASR model is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsAdapter
