VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning
Yifan Peng, Krishna C. Puvvada, Zhehuai Chen, Piotr Zelasko, He Huang,, Kunal Dhawan, Ke Hu, Shinji Watanabe, Jagadeesh Balam, Boris Ginsburg

TL;DR
This paper introduces VoiceTextBlender, a single-stage joint speech-text supervised fine-tuning method for large language models, enhancing speech capabilities while maintaining text-only performance and enabling multi-turn, mixed-modal interactions.
Contribution
It proposes a novel joint speech-text fine-tuning approach using LoRA, improving speech task performance and emergent multi-modal abilities without multi-stage training.
Findings
Outperforms previous SpeechLMs on various benchmarks.
Maintains original text-only task performance.
Handles unseen prompts and multi-turn, mixed-modal inputs effectively.
Abstract
Recent studies have augmented large language models (LLMs) with speech capabilities, leading to the development of speech language models (SpeechLMs). Earlier SpeechLMs focused on single-turn speech-based question answering (QA), where user input comprised a speech context and a text question. More recent studies have extended this to multi-turn conversations, though they often require complex, multi-stage supervised fine-tuning (SFT) with diverse data. Another critical challenge with SpeechLMs is catastrophic forgetting, where models optimized for speech tasks suffer significant degradation in text-only performance. To mitigate these issues, we propose a novel single-stage joint speech-text SFT approach on the low-rank adaptation (LoRA) of the LLM backbone. Our joint SFT combines text-only SFT data with three types of speech-related data: speech recognition and translation,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
MethodsShrink and Fine-Tune
