VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation

Yiming Zhao; Yu Zeng; Wenxuan Huang; Zhen Fang; Qing Miao; Qisheng Su; Jiawei Zhao; Jiayin Cai; Lin Chen; Zehui Chen; Yukun Qi; Yao Hu; Xiaolong Jiang; Feng Zhao

arXiv:2605.16079·cs.CV·May 18, 2026

VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation

Yiming Zhao, Yu Zeng, Wenxuan Huang, Zhen Fang, Qing Miao, Qisheng Su, Jiawei Zhao, Jiayin Cai, Lin Chen, Zehui Chen, Yukun Qi, Yao Hu, Xiaolong Jiang, Feng Zhao

PDF

2 Repos

TL;DR

VideoSeeker introduces a proactive, agentic approach to instance-level video understanding, integrating visual prompts and tool invocation to improve localization and perception accuracy.

Contribution

The paper presents a novel paradigm combining agentic reasoning with visual prompts and a large-scale data synthesis pipeline for enhanced instance-level video understanding.

Findings

01

Achieves +13.7% improvement over baselines on instance-level tasks

02

Surpasses GPT-4o and Gemini-2.5-Pro in performance

03

Demonstrates effective transferability to general video benchmarks

Abstract

Large Vision-Language Models (LVLMs) have shown significant progress in video understanding, yet they face substantial challenges in tasks requiring precise spatiotemporal localization at the instance level. Existing methods primarily rely on text prompts for human-model interaction, but these prompts struggle to provide precise spatial and temporal references, resulting in poor user experience. Furthermore, current approaches typically decouple visual perception from language reasoning, centering reasoning around language rather than visual content, which limits the model's ability to proactively perceive fine-grained visual evidence. To address these challenges, we propose VideoSeeker, a novel paradigm for instance-level video understanding through visual prompts. VideoSeeker seamlessly integrates agentic reasoning with instance-level video understanding tasks, enabling the model to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.