Get Large Language Models Ready to Speak: A Late-fusion Approach for   Speech Generation

Maohao Shen; Shun Zhang; Jilong Wu; Zhiping Xiu; Ehab AlBadawy; Yiting; Lu; Mike Seltzer; Qing He

arXiv:2410.20336·cs.CL·October 29, 2024

Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation

Maohao Shen, Shun Zhang, Jilong Wu, Zhiping Xiu, Ehab AlBadawy, Yiting, Lu, Mike Seltzer, Qing He

PDF

Open Access

TL;DR

This paper introduces a novel multimodal LLM, MoLE-Llama, that effectively integrates speech and text tasks through late-fusion fine-tuning, achieving state-of-the-art speech synthesis and versatile multimodal question-answering capabilities.

Contribution

The work presents MoLE-Llama, a new multimodal LLM that combines speech and text tasks using parameter-efficient late-fusion fine-tuning and mixture-of-expert architecture, advancing speech generation and multimodal dialogue.

Findings

01

MoLE-Llama achieves competitive speech synthesis performance.

02

It effectively handles question-answering in text and speech modalities.

03

It mitigates catastrophic forgetting across modalities.

Abstract

Large language models (LLMs) have revolutionized natural language processing (NLP) with impressive performance across various text-based tasks. However, the extension of text-dominant LLMs to with speech generation tasks remains under-explored. In this work, we introduce a text-to-speech (TTS) system powered by a fine-tuned Llama model, named TTS-Llama, that achieves state-of-the-art speech synthesis performance. Building on TTS-Llama, we further propose MoLE-Llama, a text-and-speech multimodal LLM developed through purely late-fusion parameter-efficient fine-tuning (PEFT) and a mixture-of-expert architecture. Extensive empirical results demonstrate MoLE-Llama's competitive performance on both text-only question-answering (QA) and TTS tasks, mitigating catastrophic forgetting issue in either modality. Finally, we further explore MoLE-Llama in text-in-speech-out QA tasks, demonstrating…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques

MethodsLLaMA