SpeechVerse: A Large-scale Generalizable Audio Language Model

Nilaksh Das; Saket Dingliwal; Srikanth Ronanki; Rohit Paturi,; Zhaocheng Huang; Prashant Mathur; Jie Yuan; Dhanush Bekal; Xing Niu; Sai; Muralidhar Jayanthi; Xilai Li; Karel Mundnich; Monica Sunkara; Sravan; Bodapati; Sundararajan Srinivasan; Kyu J Han; Katrin Kirchhoff

arXiv:2405.08295·cs.CL·March 26, 2025·3 cites

SpeechVerse: A Large-scale Generalizable Audio Language Model

Nilaksh Das, Saket Dingliwal, Srikanth Ronanki, Rohit Paturi,, Zhaocheng Huang, Prashant Mathur, Jie Yuan, Dhanush Bekal, Xing Niu, Sai, Muralidhar Jayanthi, Xilai Li, Karel Mundnich, Monica Sunkara, Sravan, Bodapati, Sundararajan Srinivasan, Kyu J Han, Katrin Kirchhoff

PDF

Open Access

TL;DR

SpeechVerse is a large-scale, multi-task audio language model that leverages pre-trained speech and text models with a curriculum learning approach, achieving superior zero-shot performance across diverse speech tasks.

Contribution

It introduces a novel framework that combines frozen pre-trained models with learnable parameters for multi-task learning and instruction tuning in speech processing.

Findings

01

Outperforms traditional baselines on 9 of 11 tasks

02

Demonstrates strong zero-shot generalization to out-of-domain data

03

Effective multi-task training with minimal additional parameters

Abstract

Large language models (LLMs) have shown incredible proficiency in performing tasks that require semantic understanding of natural language instructions. Recently, many works have further expanded this capability to perceive multimodal audio and text inputs, but their capabilities are often limited to specific fine-tuned tasks such as automatic speech recognition and translation. We therefore develop SpeechVerse, a robust multi-task training and curriculum learning framework that combines pre-trained speech and text foundation models via a small set of learnable parameters, while keeping the pre-trained models frozen during training. The models are instruction finetuned using continuous latent representations extracted from the speech foundation model to achieve optimal zero-shot performance on a diverse range of speech processing tasks using natural language instructions. We perform…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing

MethodsSparse Evolutionary Training