OmniV2V: Versatile Video Generation and Editing via Dynamic Content Manipulation
Sen Liang, Zhentao Yu, Zhengguang Zhou, Teng Hu, Hongmei Wang, Yi Chen, Qin Lin, Yuan Zhou, Xin Li, Qinglin Lu, Zhibo Chen

TL;DR
OmniV2V is a versatile video generation and editing model that supports multiple scenarios and operations through dynamic content manipulation and a visual-text instruction understanding module.
Contribution
The paper introduces OmniV2V, a unified model capable of diverse video editing and generation tasks with a new dynamic content manipulation module and a multi-task data system.
Findings
OmniV2V performs on par or better than existing models across tasks.
The model effectively handles multiple video editing operations.
A comprehensive dataset and benchmark were created for evaluation.
Abstract
The emergence of Diffusion Transformers (DiT) has brought significant advancements to video generation, especially in text-to-video and image-to-video tasks. Although video generation is widely applied in various fields, most existing models are limited to single scenarios and cannot perform diverse video generation and editing through dynamic content manipulation. We propose OmniV2V, a video model capable of generating and editing videos across different scenarios based on various operations, including: object movement, object addition, mask-guided video edit, try-on, inpainting, outpainting, human animation, and controllable character video synthesis. We explore a unified dynamic content manipulation injection module, which effectively integrates the requirements of the above tasks. In addition, we design a visual-text instruction module based on LLaVA, enabling the model to…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The paper is well-written and easy to follow. 2. The problems this paper is solving are important. It directly attacks the costly and inefficient "one model per task" paradigm, which is a significant bottleneck in the field. A successful unified model would be a major contribution. 3. The evaluation is comprehensive: The authors have done a massive amount of work to evaluate their model. Comparing against a long list of SOTA models (VACE) and closed-source commercial models (Kling, Pika) prov
1. Benefit of Unification is Unclear: The paper's central premise is that unification is better, but the results in Table 2 seem to contradict this. The OmniV2V-Unified model is almost always slightly worse than the authors' own single-task trained models (OmniV2V-Mask, OmniV2V-Animation, OmniV2V-Control, etc.). This suggests that the unification introduces negative interference between tasks, forcing a performance trade-off. The main benefit seems to be parameter efficiency (one model vs. many
1. Successfully unifying 8 different video editing tasks in a single framework is a notable engineering achievement with practical value 2. The paper provides thorough quantitative metrics (Face-sim, DINO-sim, CLIP, FVD, temporal consistency) and user studies across multiple tasks 3. The combination of text, image, pose, and mask conditions through a unified architecture is well-executed from an engineering perspective
1. Core contributions (token fusion, using LLaVA, dynamic training strategy) are straightforward extensions of existing techniques 2. The "unified dynamic content manipulation injection module" is essentially concatenating/adding tokens with random dropout 3. Did you retrain any open-source baselines (Mimicmotion, Champ, UniAnimate, etc.) on your OmniV2V dataset? If not, how can you claim your method is better when it's trained on different (potentially better) data? 4. Some claims are vague
1. The proposed unified video editing framework is important and can be beneficial to future works on video editing. 2. The visual results for the various video editing tasks are appealing.
1. The paper needs a major improvement in its writing quality. There are a lot of notation inconsistencies (e.g. FC and Fc used interchangeably for representing the fully connected layer); Some figures are referred to as "figure" and others are "Figure"; The citation formats in the paper are incorrect; When citing other models, some model names are also incorrect (e.g. Stableviton should be StableVITON; Grounding Sam 2 should be Grounded SAM 2; ArcFace is referred to as both ArcFace and Arcface
S1) It has a good motivation to supports a wide range of video generation and editing tasks within a single unified system. S2) The content manipulation injection module and visual-text instruction module allowed efficient fusion of multi-modal inputs for flexible and adaptable content manipulation across diverse video generation and editing tasks. This contributes to the system’s versatility and usability for a wide range of applications within one unified framework. S3) The design choice of
W1) "We propose OmniV2V, a unified video generation and editing framework" The statement is misleading. "Video-to-video (V2V)" usually implies editing one input video to produce another, not general video synthesis/generation. As the short title only mentions video-to-video editing while the paper claims broader capabilities in both video generation and editing. The claim should be revised to make sure the title accurately reflect the full scope of the contributions. W2) The method section is p
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Human Motion and Animation · Multimedia Communication and Technology
MethodsDiffusion
