Prompting and Adapter Tuning for Self-supervised Encoder-Decoder Speech Model
Kai-Wei Chang, Ming-Hsin Chen, Yun-Ping Lin, Jing Neng Hsu, Paul, Kuo-Ming Huang, Chien-yu Huang, Shang-Wen Li, Hung-yi Lee

TL;DR
This paper demonstrates that prompting and adapter tuning significantly improve performance in sequence generation and low-resource multilingual speech tasks using a self-supervised encoder-decoder model, surpassing traditional fine-tuning methods.
Contribution
It introduces the application of prompting and adapter tuning to a self-supervised encoder-decoder speech model, showing their effectiveness in sequence generation and cross-lingual tasks.
Findings
Prompting outperforms fine-tuning in low-resource scenarios.
Achieves 53% relative WER reduction in ASR.
Outperforms adapter tuning in low-resource settings.
Abstract
Prompting and adapter tuning have emerged as efficient alternatives to fine-tuning (FT) methods. However, existing studies on speech prompting focused on classification tasks and failed on more complex sequence generation tasks. Besides, adapter tuning is primarily applied with a focus on encoder-only self-supervised models. Our experiments show that prompting on Wav2Seq, a self-supervised encoder-decoder model, surpasses previous works in sequence generation tasks. It achieves a remarkable 53% relative improvement in word error rate for ASR and a 27% in F1 score for slot filling. Additionally, prompting competes with the FT method in the low-resource scenario. Moreover, we show the transferability of prompting and adapter tuning on Wav2Seq in cross-lingual ASR. When limited trainable parameters are involved, prompting and adapter tuning consistently outperform conventional FT across 7…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems
MethodsAdapter · Focus
