LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

Zuhao Yang; Sudong Wang; Kaichen Zhang; Keming Wu; Sicong Leng; Yifan Zhang; Bo Li; Chengwei Qin; Shijian Lu; Xingxuan Li; Lidong Bing

arXiv:2511.20785·cs.CV·May 22, 2026

LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

Zuhao Yang, Sudong Wang, Kaichen Zhang, Keming Wu, Sicong Leng, Yifan Zhang, Bo Li, Chengwei Qin, Shijian Lu, Xingxuan Li, Lidong Bing

PDF

1 Repo 3 Models 2 Datasets

TL;DR

LongVT introduces an innovative framework for long video reasoning that combines global and local analysis using multimodal chain-of-tool-thought, improving accuracy over existing methods.

Contribution

The paper presents a novel agentic framework leveraging native video cropping and reasoning loops, along with curated datasets and benchmarks for long video question-answering.

Findings

01

LongVT outperforms existing baselines on four long-video reasoning benchmarks.

02

Curated VideoSIAH dataset facilitates training and evaluation.

03

Effective global-to-local reasoning improves evidence grounding.

Abstract

Large multimodal models (LMMs) have shown great potential for video reasoning with textual Chain-of-Thought. However, they remain vulnerable to hallucinations, especially when processing long-form videos where evidence is sparse and temporally dispersed. Inspired by how humans comprehend long videos - by first skimming globally and then examining relevant clips for details - we introduce LongVT, an end-to-end agentic framework that enables "Thinking with Long Videos" via interleaved Multimodal Chain-of-Tool-Thought. Specifically, we exploit LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames. This global-to-local reasoning loop continues until answers are grounded in retrieved visual evidence. Given the scarcity of fine-grained question-answering (QA) data for the long video reasoning task,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

EvolvingLMMs-Lab/LongVT
github

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Explainable Artificial Intelligence (XAI)