HaploOmni: Unified Single Transformer for Multimodal Video Understanding and Generation

Yicheng Xiao; Lin Song; Rui Yang; Cheng Cheng; Zunnan Xu; Zhaoyang Zhang; Yixiao Ge; Xiu Li; Ying Shan

arXiv:2506.02975·cs.CV·June 4, 2025

HaploOmni: Unified Single Transformer for Multimodal Video Understanding and Generation

Yicheng Xiao, Lin Song, Rui Yang, Cheng Cheng, Zunnan Xu, Zhaoyang Zhang, Yixiao Ge, Xiu Li, Ying Shan

PDF

Open Access 1 Models

TL;DR

HaploOmni is a unified single transformer model that efficiently handles multimodal video understanding and generation by integrating novel training strategies and techniques, achieving competitive results with limited training costs.

Contribution

The paper introduces HaploOmni, a new single multimodal transformer with a multimodal warmup strategy and feature pre-scaling techniques, enabling efficient training and broad capabilities.

Findings

01

Achieves competitive performance on multiple benchmarks.

02

Utilizes limited training resources effectively.

03

Introduces novel multimodal integration techniques.

Abstract

With the advancement of language models, unified multimodal understanding and generation have made significant strides, with model architectures evolving from separated components to unified single-model frameworks. This paper explores an efficient training paradigm to build a single transformer for unified multimodal understanding and generation. Specifically, we propose a multimodal warmup strategy utilizing prior knowledge to extend capabilities. To address cross-modal compatibility challenges, we introduce feature pre-scaling and multimodal AdaLN techniques. Integrating the proposed technologies, we present the HaploOmni, a new single multimodal transformer. With limited training costs, HaploOmni achieves competitive performance across multiple image and video understanding and generation benchmarks over advanced unified models. All codes will be made public at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
EasonXiao-888/HaploOmni
model· 118 dl· ♡ 2
118 dl♡ 2

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Human Pose and Action Recognition · Video Analysis and Summarization