Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation
Maohao Shen, Shun Zhang, Jilong Wu, Zhiping Xiu, Ehab AlBadawy, Yiting, Lu, Mike Seltzer, Qing He

TL;DR
This paper introduces a novel multimodal LLM, MoLE-Llama, that effectively integrates speech and text tasks through late-fusion fine-tuning, achieving state-of-the-art speech synthesis and versatile multimodal question-answering capabilities.
Contribution
The work presents MoLE-Llama, a new multimodal LLM that combines speech and text tasks using parameter-efficient late-fusion fine-tuning and mixture-of-expert architecture, advancing speech generation and multimodal dialogue.
Findings
MoLE-Llama achieves competitive speech synthesis performance.
It effectively handles question-answering in text and speech modalities.
It mitigates catastrophic forgetting across modalities.
Abstract
Large language models (LLMs) have revolutionized natural language processing (NLP) with impressive performance across various text-based tasks. However, the extension of text-dominant LLMs to with speech generation tasks remains under-explored. In this work, we introduce a text-to-speech (TTS) system powered by a fine-tuned Llama model, named TTS-Llama, that achieves state-of-the-art speech synthesis performance. Building on TTS-Llama, we further propose MoLE-Llama, a text-and-speech multimodal LLM developed through purely late-fusion parameter-efficient fine-tuning (PEFT) and a mixture-of-expert architecture. Extensive empirical results demonstrate MoLE-Llama's competitive performance on both text-only question-answering (QA) and TTS tasks, mitigating catastrophic forgetting issue in either modality. Finally, we further explore MoLE-Llama in text-in-speech-out QA tasks, demonstrating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
MethodsLLaMA
