Contrastive Learning for Task-Independent SpeechLLM-Pretraining

Maike Z\"ufle; Jan Niehues

arXiv:2412.15712·cs.CL·June 2, 2025

Contrastive Learning for Task-Independent SpeechLLM-Pretraining

Maike Z\"ufle, Jan Niehues

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a scalable two-stage speech model pretraining method using contrastive learning to align speech and text representations, enabling efficient adaptation to various speech tasks with minimal data.

Contribution

It presents a novel contrastive learning-based pretraining approach for speech that improves transferability and reduces data needs compared to traditional methods.

Findings

01

Outperforms traditional ASR pretraining methods.

02

Surpasses speech translation and question answering models with only 10% of task-specific data.

03

Enables effective task adaptation with minimal fine-tuning.

Abstract

Large language models (LLMs) excel in natural language processing but adapting these LLMs to speech processing tasks efficiently is not straightforward. Direct task-specific fine-tuning is limited by overfitting risks, data requirements, and computational costs. To address these challenges, we propose a scalable, two-stage training approach: (1) A task-independent speech pretraining stage using contrastive learning to align text and speech representations over all layers, followed by (2) a task-specific fine-tuning stage requiring minimal data. This approach outperforms traditional ASR pretraining and enables the model to surpass models specialized on speech translation and question answering while being trained on only 10% of the task-specific data.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

maikezuefle/contr-pretraining
pytorchOfficial

Videos

Contrastive Learning for Task-Independent SpeechLLM-Pretraining· underline

Taxonomy

TopicsSpeech and dialogue systems · Intelligent Tutoring Systems and Adaptive Learning · Speech Recognition and Synthesis

MethodsContrastive Learning · ALIGN