Contrastive Learning for Task-Independent SpeechLLM-Pretraining
Maike Z\"ufle, Jan Niehues

TL;DR
This paper introduces a scalable two-stage speech model pretraining method using contrastive learning to align speech and text representations, enabling efficient adaptation to various speech tasks with minimal data.
Contribution
It presents a novel contrastive learning-based pretraining approach for speech that improves transferability and reduces data needs compared to traditional methods.
Findings
Outperforms traditional ASR pretraining methods.
Surpasses speech translation and question answering models with only 10% of task-specific data.
Enables effective task adaptation with minimal fine-tuning.
Abstract
Large language models (LLMs) excel in natural language processing but adapting these LLMs to speech processing tasks efficiently is not straightforward. Direct task-specific fine-tuning is limited by overfitting risks, data requirements, and computational costs. To address these challenges, we propose a scalable, two-stage training approach: (1) A task-independent speech pretraining stage using contrastive learning to align text and speech representations over all layers, followed by (2) a task-specific fine-tuning stage requiring minimal data. This approach outperforms traditional ASR pretraining and enables the model to surpass models specialized on speech translation and question answering while being trained on only 10% of the task-specific data.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech and dialogue systems · Intelligent Tutoring Systems and Adaptive Learning · Speech Recognition and Synthesis
MethodsContrastive Learning · ALIGN
