Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model

Shaolei Zhang; Shoutao Guo; Qingkai Fang; Yan Zhou; Yang Feng

arXiv:2506.13642·cs.AI·June 24, 2025

Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model

Shaolei Zhang, Shoutao Guo, Qingkai Fang, Yan Zhou, Yang Feng

PDF

2 Repos 1 Models 1 Datasets

TL;DR

Stream-Omni introduces a multimodal model that efficiently aligns text, vision, and speech modalities, enabling flexible interactions and improved performance across various tasks with less data.

Contribution

It proposes a novel layer-dimensional mapping approach for modality alignment, reducing data requirements and enhancing multimodal interaction capabilities.

Findings

01

Strong performance on visual understanding tasks

02

Effective speech interaction and vision-grounded speech tasks

03

Supports intermediate text outputs during speech interaction

Abstract

The emergence of GPT-4o-like large multimodal models (LMMs) has raised the exploration of integrating text, vision, and speech modalities to support more flexible multimodal interaction. Existing LMMs typically concatenate representation of modalities along the sequence dimension and feed them into a large language model (LLM) backbone. While sequence-dimension concatenation is straightforward for modality integration, it often relies heavily on large-scale data to learn modality alignments. In this paper, we aim to model the relationships between modalities more purposefully, thereby achieving more efficient and flexible modality alignments. To this end, we propose Stream-Omni, a large language-vision-speech model with efficient modality alignments, which can simultaneously support interactions under various modality combinations. Stream-Omni employs LLM as the backbone and aligns the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
ICTNLP/stream-omni-8b
model· 8 dl· ♡ 49
8 dl♡ 49

Datasets

echodict/StreamSpeech
dataset· 1.1k dl
1.1k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.