CoTasks: Chain-of-Thought based Video Instruction Tuning Tasks

Yanan Wang; Julio Vizcarra; Zhi Li; Hao Niu; Mori Kurokawa

arXiv:2507.13609·cs.CV·July 21, 2025

CoTasks: Chain-of-Thought based Video Instruction Tuning Tasks

Yanan Wang, Julio Vizcarra, Zhi Li, Hao Niu, Mori Kurokawa

PDF

Open Access

TL;DR

This paper introduces CoTasks, a framework that decomposes complex video questions into structured reasoning steps, significantly improving the reasoning capabilities of VideoLLMs through object-centric, step-by-step training.

Contribution

It proposes a novel decomposition of video reasoning tasks into entity-level subtasks and embeds these into training to enhance model reasoning abilities.

Findings

01

LLaVA-video-7B improves by +3.3 points on GPT-4 evaluation.

02

Qwen2.5-VL-3B gains +17.4 points, with large boosts in causal, temporal, and descriptive reasoning.

03

Structured CoT supervision significantly enhances compositional video reasoning.

Abstract

Despite recent progress in video large language models (VideoLLMs), a key open challenge remains: how to equip models with chain-of-thought (CoT) reasoning abilities grounded in fine-grained object-level video understanding. Existing instruction-tuned models, such as the Qwen and LLaVA series, are trained on high-level video-text pairs, often lacking structured annotations necessary for compositional, step-by-step reasoning. We propose CoTasks: Chain-of-Thought based Video Instruction Tuning Tasks, a new framework that decomposes complex video questions of existing datasets (e.g., NeXT-QA, STAR) into four entity-level foundational tasks: frame localization, entity tracking, spatial and temporal relation extraction. By embedding these intermediate CoT-style reasoning steps into the input, CoTasks enables models to explicitly perform object-centric spatiotemporal reasoning. Experiments on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning