LLaMA-Omni: Seamless Speech Interaction with Large Language Models

Qingkai Fang; Shoutao Guo; Yan Zhou; Zhengrui Ma; Shaolei Zhang; Yang; Feng

arXiv:2409.06666·cs.CL·March 4, 2025·2 cites

LLaMA-Omni: Seamless Speech Interaction with Large Language Models

Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, Yang, Feng

PDF

Open Access 3 Repos 3 Models 3 Reviews

TL;DR

LLaMA-Omni introduces a low-latency, open-source speech interaction model that directly generates text and speech responses from speech inputs, enhancing real-time user experience without transcription.

Contribution

The paper presents LLaMA-Omni, a novel architecture integrating speech encoding and streaming decoding for direct speech-to-response interaction using open-source LLMs.

Findings

01

Outperforms previous models in response quality and style.

02

Achieves response latency as low as 226ms.

03

Training completes in less than 3 days on 4 GPUs.

Abstract

Models like GPT-4o enable real-time interaction with large language models (LLMs) through speech, significantly enhancing user experience compared to traditional text-based interaction. However, there is still a lack of exploration on how to build speech interaction models based on open-source LLMs. To address this, we propose LLaMA-Omni, a novel model architecture designed for low-latency and high-quality speech interaction with LLMs. LLaMA-Omni integrates a pretrained speech encoder, a speech adaptor, an LLM, and a streaming speech decoder. It eliminates the need for speech transcription, and can simultaneously generate text and speech responses directly from speech instructions with extremely low latency. We build our model based on the latest Llama-3.1-8B-Instruct model. To align the model with speech interaction scenarios, we construct a dataset named InstructS2S-200K, which…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 5

Strengths

This is probably the first public work that connects a speech-to-text Speech-LLM with a streaming text-to-speech model so as to enable low latency speech outputs.

Weaknesses

1. The speech-to-text component of this model is not new (relevant works like SALMONN and Qwen2-audio) and the main addition is the text-to-speech component. The latter has a clear connection with the streaming TTS model research but the paper does not spend enough content to acknowledge and compare to this line of research. To elaborate on this, the paper should compare the proposed method with prior research in this space and motivate this unique CTC based design. The paper should also replace

Reviewer 02Rating 6Confidence 4

Strengths

1. **Conversational Data with InstructS2S-200K:** The proposed InstructS2S-200K dataset addresses a notable gap in conversational, oral-style speech instruction data. By providing a large, purpose-built dataset tailored to natural, interactive speech patterns, this work supports more effective alignment of LLM responses with human conversational norms. 2. **Cost-Effective Training Process:** The two-stage training process adopted in LLaMA-Omni demonstrates a practical approach to reducing train

Weaknesses

1. **Lack of Novelty:** While the authors suggest that they proposed a "novel model architecture," it appears to primarily build on a combination of existing methods without a clear breakthrough in architecture: 1) The connection of a speech encoder to an LLM via a speech adaptor has been widely explored in prior research. 2) The streaming TTS module seems to be directly based on previous designs, specifically following the approach outlined in [1]. In summary, while LLaMA-Omni

Reviewer 03Rating 6Confidence 4

Strengths

1. The paper is well-written and presents a promising approach to seamless speech interaction with LLMs. The reported response latency of 226ms is impressive and demonstrates the value for real-world applications. 2. The authors have made a valuable contribution by creating the InstructS2S-200K dataset, which is tailored to speech interaction scenarios. The core motivation for building this dataset is that "in speech interactions, concise yet informative responses are typically preferred", whic

Weaknesses

1. Although this work presents a seamless spoken dialogue model, the choice of connecting a LLM with a CTC-based streaming TTS module is not well-motivated. - Autoregressive TTS models naturally support streaming decoding, but this paper does not discuss or compare with the traditional autoregressive decoding. - The CTC-based streaming TTS module incorporates ideas from StreamSpeech [1] and achieves streaming output by segmenting generated units into chunks. However, it is not evident w

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems

MethodsALIGN