Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning

Haoji Zhang; Xin Gu; Jiawen Li; Chixiang Ma; Sule Bai; Chubin Zhang; Bowen Zhang; Zhichao Zhou; Dongliang He; Yansong Tang

arXiv:2508.04416·cs.CV·September 4, 2025

Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning

Haoji Zhang, Xin Gu, Jiawen Li, Chixiang Ma, Sule Bai, Chubin Zhang, Bowen Zhang, Zhichao Zhou, Dongliang He, Yansong Tang

PDF

1 Datasets

TL;DR

This paper introduces VITAL, a novel multimodal reasoning framework that enhances long video understanding by densely sampling frames, generating multimodal chain-of-thoughts, and employing multi-task reinforcement learning to outperform existing methods.

Contribution

The paper presents VITAL, an end-to-end agentic video reasoning model with a visual toolbox, new datasets, and a difficulty-aware reinforcement learning algorithm, advancing long video reasoning capabilities.

Findings

01

VITAL outperforms existing methods on 11 video understanding benchmarks.

02

The model effectively handles long videos and complex reasoning chains.

03

Multi-task training improves performance in question answering and temporal grounding.

Abstract

The video reasoning ability of multimodal large language models (MLLMs) is crucial for downstream tasks like video question answering and temporal grounding. While recent approaches have explored text-based chain-of-thought (CoT) reasoning for MLLMs, these methods often suffer from limited cross-modal interaction and increased hallucination, especially with longer videos or reasoning chains. To address these challenges, we propose Video Intelligence via Tool-Augmented Learning (VITAL), a novel end-to-end agentic video reasoning framework. With a visual toolbox, the model can densely sample new video frames on demand and generate multimodal CoT for precise long video reasoning. We observe that temporal grounding and question answering are mutually beneficial for video understanding tasks. Therefore, we construct two high-quality multi-task video reasoning datasets MTVR-CoT-72k for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

zhang9302002/MultiTaskVideoReasoning
dataset· 188 dl
188 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.