UFVideo: Towards Unified Fine-Grained Video Cooperative Understanding with Large Language Models

Hewen Pan; Cong Wei; Dashuang Liang; Zepeng Huang; Pengfei Gao; Ziqi Zhou; Lulu Xue; Pengfei Yan; Xiaoming Wei; Minghui Li; Shengshan Hu

arXiv:2512.11336·cs.CV·March 25, 2026

UFVideo: Towards Unified Fine-Grained Video Cooperative Understanding with Large Language Models

Hewen Pan, Cong Wei, Dashuang Liang, Zepeng Huang, Pengfei Gao, Ziqi Zhou, Lulu Xue, Pengfei Yan, Xiaoming Wei, Minghui Li, Shengshan Hu

PDF

Open Access 1 Models 1 Datasets

TL;DR

UFVideo introduces a unified multi-grained Video LLM capable of global, pixel, and temporal understanding, bridging the gap in comprehensive video perception and outperforming existing specialized models.

Contribution

This work presents UFVideo, the first Video LLM with unified multi-grained understanding, and constructs UFVideo-Bench for comprehensive evaluation across diverse video tasks.

Findings

01

UFVideo outperforms GPT-4o on UFVideo-Bench tasks.

02

UFVideo effectively handles global, pixel, and temporal scales.

03

Validated on 9 public benchmarks.

Abstract

With the advancement of multi-modal Large Language Models (LLMs), Video LLMs have been further developed to perform on holistic and specialized video understanding. However, existing works are limited to specialized video understanding tasks, failing to achieve a comprehensive and multi-grained video perception. To bridge this gap, we introduce UFVideo, the first Video LLM with unified multi-grained cooperative understanding capabilities. Specifically, we design unified visual-language guided alignment to flexibly handle video understanding across global, pixel and temporal scales within a single model. UFVideo dynamically encodes the visual and text inputs of different tasks and generates the textual response, temporal localization, or grounded mask. Additionally, to evaluate challenging multi-grained video understanding tasks, we construct the UFVideo-Bench consisting of three…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Hevven/UFVideo-7B
model· 8 dl· ♡ 1
8 dl♡ 1

Datasets

Hevven/UFVideo-Bench
dataset· 25 dl
25 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis