SpeechMapper: Speech-to-text Embedding Projector for LLMs
Biswesh Mohapatra, Marcely Zanon Boito, Ioan Calapodescu

TL;DR
SpeechMapper introduces a cost-efficient, robust method for integrating speech models with large language models, reducing overfitting and training costs while maintaining high performance across speech tasks.
Contribution
It proposes a novel two-stage training approach that pretrains speech embeddings separately and efficiently attaches them to LLMs, improving generalization and reducing computational requirements.
Findings
SpeechMapper rivals top speech LLMs without task-specific training.
It outperforms existing models in task-specific settings with less data and compute.
The approach is versatile across speech translation and question answering tasks.
Abstract
Current speech LLMs bridge speech foundation models to LLMs using projection layers, training all of these components on speech instruction data. This strategy is computationally intensive and susceptible to task and prompt overfitting. We present SpeechMapper, a cost-efficient speech-to-LLM-embedding training approach that mitigates overfitting, enabling more robust and generalizable models. Our model is first pretrained without the LLM on inexpensive hardware, and then efficiently attached to the target LLM via a brief 1K-step instruction tuning (IT) stage. Through experiments on speech translation and spoken question answering, we demonstrate the versatility of SpeechMapper's pretrained block, presenting results for both task-agnostic IT, an ASR-based adaptation strategy that does not train in the target task, and task-specific IT. In task-agnostic settings, Speechmapper rivals the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
