DeSTA2: Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data

Ke-Han Lu; Zhehuai Chen; Szu-Wei Fu; Chao-Han Huck Yang; Jagadeesh Balam; Boris Ginsburg; Yu-Chiang Frank Wang; Hung-yi Lee

arXiv:2409.20007·eess.AS·July 30, 2025

DeSTA2: Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data

Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, Chao-Han Huck Yang, Jagadeesh Balam, Boris Ginsburg, Yu-Chiang Frank Wang, Hung-yi Lee

PDF

Open Access 1 Repo 1 Models

TL;DR

DeSTA2 introduces a method to enhance speech language models with paralinguistic understanding and instruction-following abilities without requiring speech instruction-tuning data, reducing annotation efforts and preserving language skills.

Contribution

It presents an automatic data creation process that injects speech understanding into SLMs while maintaining language capabilities, enabling effective speech tasks without speech instruction-tuning data.

Findings

01

Achieves strong performance on Dynamic-SUPERB and AIR-Bench-Chat benchmarks.

02

Demonstrates ability to follow complex instructions and reasoning tasks.

03

Reduces reliance on extensive annotated datasets.

Abstract

Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs) by incorporating pre-trained speech models. However, these SLMs often undergo extensive speech instruction-tuning to bridge the gap between speech and text modalities. This requires significant annotation efforts and risks catastrophic forgetting of the original language capabilities. In this work, we present a simple yet effective automatic process for creating speech-text pair data that carefully injects speech paralinguistic understanding abilities into SLMs while preserving the inherent language capabilities of the text-based LLM. Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data, achieving impressive performance on Dynamic-SUPERB and AIR-Bench-Chat benchmarks. Furthermore, our model exhibits the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kehanlu/DeSTA2
pytorch

Models

🤗
DeSTA-ntu/DeSTA2-8B-beta
model· 77 dl· ♡ 8
77 dl♡ 8

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Speech Recognition and Synthesis