TL;DR
CogOmniControl introduces a reasoning-driven framework for controllable video generation that accurately interprets user creative intent from sparse conditions and outperforms existing models on professional benchmarks.
Contribution
It develops a specialized anime-trained vision-language model and a unified control framework that enhances alignment with creative intent and integrates multiple control signals.
Findings
Outperforms existing open-source models on professional benchmarks.
Generates more professional and clear videos from sparse or abstract conditions.
Successfully integrates reasoning and control for improved video generation quality.
Abstract
Recent diffusion models achieve strong photorealism and fluency in video generation, yet remain fragile under abstract, sparse or complex conditions, leading to poor performance in professional production workflows such as storyboard sketches and clay render conditions. Existing video generation models, either inject conditions through adapters or couple a generic vision-language model (VLM) within a diffusion backbone, leaving a capability gap and failing to produce the videos that align with the user's creative intent. We present CogOmniControl, a reasoning-driven framework that factorizes controllable video generation into creative intent cognition and generation. Specifically, we train a specialized CogVLM using authentic anime production data. Compared to generic VLMs, it generates more professional and clear outputs, accurately cognizing user creative intent from sparse and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
