Tele-Omni: a Unified Multimodal Framework for Video Generation and Editing

Jialun Liu; Tian Li; Xiao Cao; Yukuo Ma; Gonghu Shang; Haibin Huang; Chi Zhang; Xiangzhen Chang; Zhiyong Huang; Jiakui Hu; Zuoxin Li; Yuanzhi Liang; Cong Liu; Junqi Liu; Robby T. Tan; Haitong Tang; Qizhen Weng; Yifan Xu; Liying Yang; Xiaoyan Yang; Peng Yu; Shiwen Zhang; Xuelong Li

arXiv:2602.09609·cs.CV·February 24, 2026

Tele-Omni: a Unified Multimodal Framework for Video Generation and Editing

Jialun Liu, Tian Li, Xiao Cao, Yukuo Ma, Gonghu Shang, Haibin Huang, Chi Zhang, Xiangzhen Chang, Zhiyong Huang, Jiakui Hu, Zuoxin Li, Yuanzhi Liang, Cong Liu, Junqi Liu, Robby T. Tan, Haitong Tang, Qizhen Weng, Yifan Xu, Liying Yang, Xiaoyan Yang, Peng Yu, Shiwen Zhang

PDF

Open Access

TL;DR

Tele-Omni introduces a versatile multimodal framework that unifies various video generation and editing tasks using structured instructions and diffusion models, enabling flexible control and high-quality outputs.

Contribution

The paper presents Tele-Omni, the first unified multimodal framework capable of handling diverse video tasks with a single model, integrating instruction parsing and diffusion-based synthesis.

Findings

01

Achieves competitive performance across multiple video tasks

02

Supports multimodal inputs including text, images, and reference videos

03

Maintains high temporal coherence and visual consistency

Abstract

Recent advances in diffusion-based video generation have substantially improved visual fidelity and temporal coherence. However, most existing approaches remain task-specific and rely primarily on textual instructions, limiting their ability to handle multimodal inputs, contextual references, and diverse video generation and editing scenarios within a unified framework. Moreover, many video editing methods depend on carefully engineered pipelines tailored to individual operations, which hinders scalability and composability. In this paper, we propose Tele-Omni, a unified multimodal framework for video generation and editing that follows multimodal instructions, including text, images, and reference videos, within a single model. Tele-Omni leverages pretrained multimodal large language models to parse heterogeneous instructions and infer structured generation or editing intents, while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Human Motion and Animation