VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Chaoyou Fu; Haojia Lin; Xiong Wang; Yi-Fan Zhang; Yunhang Shen; Xiaoyu Liu; Haoyu Cao; Zuwei Long; Heting Gao; Ke Li; Long Ma; Xiawu Zheng; Rongrong Ji; Xing Sun; Caifeng Shan; Ran He

arXiv:2501.01957·cs.CV·October 27, 2025·3 cites

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Chaoyou Fu, Haojia Lin, Xiong Wang, Yi-Fan Zhang, Yunhang Shen, Xiaoyu Liu, Haoyu Cao, Zuwei Long, Heting Gao, Ke Li, Long Ma, Xiawu Zheng, Rongrong Ji, Xing Sun, Caifeng Shan, Ran He

PDF

Open Access 1 Repo 1 Models

TL;DR

VITA-1.5 introduces a multi-stage training approach for multimodal large language models, enabling near real-time vision and speech interaction without separate speech modules, advancing multimodal dialogue systems.

Contribution

It presents a novel multi-stage training methodology that integrates visual and speech understanding in LLMs, achieving efficient real-time multimodal interaction.

Findings

01

Outperforms state-of-the-art on image, video, and speech benchmarks.

02

Enables speech-to-speech dialogue without separate ASR and TTS modules.

03

Achieves near real-time vision and speech interaction.

Abstract

Recent Multimodal Large Language Models (MLLMs) have typically focused on integrating visual and textual modalities, with less emphasis placed on the role of speech in enhancing interaction. However, speech plays a crucial role in multimodal dialogue systems, and implementing high-performance in both vision and speech tasks remains a significant challenge due to the fundamental modality differences. In this paper, we propose a carefully designed multi-stage training methodology that progressively trains LLM to understand both visual and speech information, ultimately enabling fluent vision and speech interaction. Our approach not only preserves strong vision-language capacity, but also enables efficient speech-to-speech dialogue capabilities without separate ASR and TTS modules, significantly accelerating multimodal end-to-end response speed. By comparing our method against…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

VITA-MLLM/VITA
pytorchOfficial

Models

🤗
VITA-MLLM/VITA-1.5
model· 71 dl· ♡ 50
71 dl♡ 50

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Advanced Data Compression Techniques · Image Retrieval and Classification Techniques