CogOmniControl: Reasoning-Driven Controllable Video Generation via Creative Intent Cognition

Hongji Yang; Songlian Li; Yucheng Zhou; Xiaotong Zhao; Alan Zhao; Chengzhong Xu; Jianbing Shen

arXiv:2605.19995·cs.CV·May 20, 2026

CogOmniControl: Reasoning-Driven Controllable Video Generation via Creative Intent Cognition

Hongji Yang, Songlian Li, Yucheng Zhou, Xiaotong Zhao, Alan Zhao, Chengzhong Xu, Jianbing Shen

PDF

1 Repo

TL;DR

CogOmniControl introduces a reasoning-driven framework for controllable video generation that accurately interprets user creative intent from sparse conditions and outperforms existing models on professional benchmarks.

Contribution

It develops a specialized anime-trained vision-language model and a unified control framework that enhances alignment with creative intent and integrates multiple control signals.

Findings

01

Outperforms existing open-source models on professional benchmarks.

02

Generates more professional and clear videos from sparse or abstract conditions.

03

Successfully integrates reasoning and control for improved video generation quality.

Abstract

Recent diffusion models achieve strong photorealism and fluency in video generation, yet remain fragile under abstract, sparse or complex conditions, leading to poor performance in professional production workflows such as storyboard sketches and clay render conditions. Existing video generation models, either inject conditions through adapters or couple a generic vision-language model (VLM) within a diffusion backbone, leaving a capability gap and failing to produce the videos that align with the user's creative intent. We present CogOmniControl, a reasoning-driven framework that factorizes controllable video generation into creative intent cognition and generation. Specifically, we train a specialized CogVLM using authentic anime production data. Compared to generic VLMs, it generates more professional and clear outputs, accurately cognizing user creative intent from sparse and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://um-lab.github.io/CogOmniControl
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.