OpenOmni: Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-Time Self-Aware Emotional Speech Synthesis

Run Luo; Ting-En Lin; Haonan Zhang; Yuchuan Wu; Xiong Liu; Min Yang; Yongbin Li; Longze Chen; Jiaming Li; Lei Zhang; Xiaobo Xia; Hamid Alinejad-Rokny; Fei Huang

arXiv:2501.04561·cs.CL·September 24, 2025

OpenOmni: Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-Time Self-Aware Emotional Speech Synthesis

Run Luo, Ting-En Lin, Haonan Zhang, Yuchuan Wu, Xiong Liu, Min Yang, Yongbin Li, Longze Chen, Jiaming Li, Lei Zhang, Xiaobo Xia, Hamid Alinejad-Rokny, Fei Huang

PDF

Open Access 1 Repo 1 Models

TL;DR

OpenOmni introduces a novel two-stage training framework for open-source omnimodal large language models, achieving state-of-the-art performance in multimodal understanding and real-time emotional speech synthesis with fewer resources.

Contribution

It presents a new training approach that enables near-zero-shot vision-to-speech generalization and real-time emotional speech synthesis in open-source models.

Findings

01

Outperforms state-of-the-art models on multiple benchmarks.

02

Achieves 4-point improvement on OmniBench with fewer training samples.

03

Enables real-time speech synthesis with <1s latency and 5x faster inference.

Abstract

Recent advancements in omnimodal learning have significantly improved understanding and generation across images, text, and speech, yet these developments remain predominantly confined to proprietary models. The lack of high-quality omnimodal datasets and the challenges of real-time emotional speech synthesis have notably hindered progress in open-source research. To address these limitations, we introduce \name, a two-stage training framework that integrates omnimodal alignment and speech generation to develop a state-of-the-art omnimodal large language model. In the alignment phase, a pre-trained speech model undergoes further training on text-image tasks, enabling (near) zero-shot generalization from vision to speech, outperforming models trained on tri-modal datasets. In the speech generation phase, a lightweight decoder is trained on speech tasks with direct preference…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rainbowluocs/openomni
pytorchOfficial

Models

🤗
Tongyi-ConvAI/OpenOmni
model· ♡ 3
♡ 3

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling