VoiceTextBlender: Augmenting Large Language Models with Speech   Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning

Yifan Peng; Krishna C. Puvvada; Zhehuai Chen; Piotr Zelasko; He Huang,; Kunal Dhawan; Ke Hu; Shinji Watanabe; Jagadeesh Balam; Boris Ginsburg

arXiv:2410.17485·cs.CL·February 10, 2025

VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning

Yifan Peng, Krishna C. Puvvada, Zhehuai Chen, Piotr Zelasko, He Huang,, Kunal Dhawan, Ke Hu, Shinji Watanabe, Jagadeesh Balam, Boris Ginsburg

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces VoiceTextBlender, a single-stage joint speech-text supervised fine-tuning method for large language models, enhancing speech capabilities while maintaining text-only performance and enabling multi-turn, mixed-modal interactions.

Contribution

It proposes a novel joint speech-text fine-tuning approach using LoRA, improving speech task performance and emergent multi-modal abilities without multi-stage training.

Findings

01

Outperforms previous SpeechLMs on various benchmarks.

02

Maintains original text-only task performance.

03

Handles unseen prompts and multi-turn, mixed-modal inputs effectively.

Abstract

Recent studies have augmented large language models (LLMs) with speech capabilities, leading to the development of speech language models (SpeechLMs). Earlier SpeechLMs focused on single-turn speech-based question answering (QA), where user input comprised a speech context and a text question. More recent studies have extended this to multi-turn conversations, though they often require complex, multi-stage supervised fine-tuning (SFT) with diverse data. Another critical challenge with SpeechLMs is catastrophic forgetting, where models optimized for speech tasks suffer significant degradation in text-only performance. To mitigate these issues, we propose a novel single-stage joint speech-text SFT approach on the low-rank adaptation (LoRA) of the LLM backbone. Our joint SFT combines text-only SFT data with three types of speech-related data: speech recognition and translation,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pyf98/NeMo_VoiceTextBlender
jaxOfficial

Videos

VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning· underline

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques

MethodsShrink and Fine-Tune