UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist

Zhengyang Liang; Daoan Zhang; Huichi Zhou; Rui Huang; Bobo Li; Yuechen Zhang; Shengqiong Wu; Xiaohan Wang; Jiebo Luo; Lizi Liao; Hao Fei

arXiv:2511.08521·cs.CV·November 12, 2025

UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist

Zhengyang Liang, Daoan Zhang, Huichi Zhou, Rui Huang, Bobo Li, Yuechen Zhang, Shengqiong Wu, Xiaohan Wang, Jiebo Luo, Lizi Liao, Hao Fei

PDF

Open Access 2 Datasets

TL;DR

UniVA is an open-source multi-agent framework that unifies video understanding, editing, and generation, enabling complex, iterative workflows for next-generation video AI applications.

Contribution

It introduces a hierarchical multi-agent architecture with a planning and execution system, and a comprehensive benchmark suite for multi-step video tasks.

Findings

01

Supports multi-round, conditioned video workflows

02

Achieves long-horizon reasoning with hierarchical memory

03

Provides a fully open-source platform for research

Abstract

While specialized AI models excel at isolated video tasks like generation or understanding, real-world applications demand complex, iterative workflows that combine these capabilities. To bridge this gap, we introduce UniVA, an open-source, omni-capable multi-agent framework for next-generation video generalists that unifies video understanding, segmentation, editing, and generation into cohesive workflows. UniVA employs a Plan-and-Act dual-agent architecture that drives a highly automated and proactive workflow: a planner agent interprets user intentions and decomposes them into structured video-processing steps, while executor agents execute these through modular, MCP-based tool servers (for analysis, generation, editing, tracking, etc.). Through a hierarchical multi-level memory (global knowledge, task context, and user-specific preferences), UniVA sustains long-horizon reasoning,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Reinforcement Learning in Robotics