VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation
Yiming Zhao, Yu Zeng, Wenxuan Huang, Zhen Fang, Qing Miao, Qisheng Su, Jiawei Zhao, Jiayin Cai, Lin Chen, Zehui Chen, Yukun Qi, Yao Hu, Xiaolong Jiang, Feng Zhao

TL;DR
VideoSeeker introduces a proactive, agentic approach to instance-level video understanding, integrating visual prompts and tool invocation to improve localization and perception accuracy.
Contribution
The paper presents a novel paradigm combining agentic reasoning with visual prompts and a large-scale data synthesis pipeline for enhanced instance-level video understanding.
Findings
Achieves +13.7% improvement over baselines on instance-level tasks
Surpasses GPT-4o and Gemini-2.5-Pro in performance
Demonstrates effective transferability to general video benchmarks
Abstract
Large Vision-Language Models (LVLMs) have shown significant progress in video understanding, yet they face substantial challenges in tasks requiring precise spatiotemporal localization at the instance level. Existing methods primarily rely on text prompts for human-model interaction, but these prompts struggle to provide precise spatial and temporal references, resulting in poor user experience. Furthermore, current approaches typically decouple visual perception from language reasoning, centering reasoning around language rather than visual content, which limits the model's ability to proactively perceive fine-grained visual evidence. To address these challenges, we propose VideoSeeker, a novel paradigm for instance-level video understanding through visual prompts. VideoSeeker seamlessly integrates agentic reasoning with instance-level video understanding tasks, enabling the model to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
