VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with   Video LLM

Yuqian Yuan; Hang Zhang; Wentong Li; Zesen Cheng; Boqiang Zhang; Long; Li; Xin Li; Deli Zhao; Wenqiao Zhang; Yueting Zhuang; Jianke Zhu; Lidong Bing

arXiv:2501.00599·cs.CV·March 26, 2025

VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM

Yuqian Yuan, Hang Zhang, Wentong Li, Zesen Cheng, Boqiang Zhang, Long, Li, Xin Li, Deli Zhao, Wenqiao Zhang, Yueting Zhuang, Jianke Zhu, Lidong Bing

PDF

Open Access 1 Repo 3 Models

TL;DR

The paper introduces VideoRefer Suite, a comprehensive framework including a new dataset, model, and benchmark, to enhance spatial-temporal object understanding in Video LLMs, addressing previous limitations in fine-grained video comprehension.

Contribution

It presents a new high-quality object-level video instruction dataset, a versatile spatial-temporal object encoder model, and a benchmark for detailed video understanding evaluation.

Findings

01

VideoRefer model achieves promising results on referring benchmarks.

02

The dataset VideoRefer-700K enables fine-grained spatial-temporal understanding.

03

The benchmark facilitates comprehensive assessment of Video LLM capabilities.

Abstract

Video Large Language Models (Video LLMs) have recently exhibited remarkable capabilities in general video understanding. However, they mainly focus on holistic comprehension and struggle with capturing fine-grained spatial and temporal details. Besides, the lack of high-quality object-level video instruction data and a comprehensive benchmark further hinders their advancements. To tackle these challenges, we introduce the VideoRefer Suite to empower Video LLM for finer-level spatial-temporal video understanding, i.e., enabling perception and reasoning on any objects throughout the video. Specially, we thoroughly develop VideoRefer Suite across three essential aspects: dataset, model, and benchmark. Firstly, we introduce a multi-agent data engine to meticulously curate a large-scale, high-quality object-level video instruction dataset, termed VideoRefer-700K. Next, we present the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

damo-nlp-sg/videorefer
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition

MethodsFocus