Wav2Prompt: End-to-End Speech Prompt Generation and Tuning For LLM in Zero and Few-shot Learning
Keqi Deng, Guangzhi Sun, Philip C. Woodland

TL;DR
Wav2Prompt introduces an end-to-end speech prompt generation method that enables zero and few-shot learning with large language models, improving performance in spoken language tasks without extensive task-specific data.
Contribution
It presents a novel training approach that learns speech representations aligned with LLM prompts, avoiding overfitting and preserving LLM capabilities, and demonstrates competitive zero-shot and few-shot performance.
Findings
Performs comparably to ASR-LLM cascades in zero-shot tasks.
Achieves significant improvements in few-shot scenarios, e.g., 8.5 BLEU points in speech translation.
Effective for multiple spoken language understanding tasks.
Abstract
Wav2Prompt is proposed which allows straightforward integration between spoken input and a text-based large language model (LLM). Wav2Prompt uses a simple training process with only the same data used to train an automatic speech recognition (ASR) model. After training, Wav2Prompt learns continuous representations from speech and uses them as LLM prompts. To avoid task over-fitting issues found in prior work and preserve the emergent abilities of LLMs, Wav2Prompt takes LLM token embeddings as the training targets and utilises a continuous integrate-and-fire mechanism for explicit speech-text alignment. Therefore, a Wav2Prompt-LLM combination can be applied to zero-shot spoken language tasks such as speech translation (ST), speech understanding (SLU), speech question answering (SQA) and spoken-query-based QA (SQQA). It is shown that for these zero-shot tasks, Wav2Prompt performs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems
