SpeechVerse: A Large-scale Generalizable Audio Language Model
Nilaksh Das, Saket Dingliwal, Srikanth Ronanki, Rohit Paturi,, Zhaocheng Huang, Prashant Mathur, Jie Yuan, Dhanush Bekal, Xing Niu, Sai, Muralidhar Jayanthi, Xilai Li, Karel Mundnich, Monica Sunkara, Sravan, Bodapati, Sundararajan Srinivasan, Kyu J Han, Katrin Kirchhoff

TL;DR
SpeechVerse is a large-scale, multi-task audio language model that leverages pre-trained speech and text models with a curriculum learning approach, achieving superior zero-shot performance across diverse speech tasks.
Contribution
It introduces a novel framework that combines frozen pre-trained models with learnable parameters for multi-task learning and instruction tuning in speech processing.
Findings
Outperforms traditional baselines on 9 of 11 tasks
Demonstrates strong zero-shot generalization to out-of-domain data
Effective multi-task training with minimal additional parameters
Abstract
Large language models (LLMs) have shown incredible proficiency in performing tasks that require semantic understanding of natural language instructions. Recently, many works have further expanded this capability to perceive multimodal audio and text inputs, but their capabilities are often limited to specific fine-tuned tasks such as automatic speech recognition and translation. We therefore develop SpeechVerse, a robust multi-task training and curriculum learning framework that combines pre-trained speech and text foundation models via a small set of learnable parameters, while keeping the pre-trained models frozen during training. The models are instruction finetuned using continuous latent representations extracted from the speech foundation model to achieve optimal zero-shot performance on a diverse range of speech processing tasks using natural language instructions. We perform…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
MethodsSparse Evolutionary Training
