ReasonAct: Progressive Training for Fine-Grained Video Reasoning in Small Models

Jiaxin Liu; Zhaolu Kang

arXiv:2508.01533·cs.CV·November 27, 2025

ReasonAct: Progressive Training for Fine-Grained Video Reasoning in Small Models

Jiaxin Liu, Zhaolu Kang

PDF

Open Access

TL;DR

ReasonAct introduces a three-stage progressive training method that significantly improves small model performance in fine-grained video reasoning tasks, combining text reasoning, video fine-tuning, and temporal reinforcement learning.

Contribution

The paper presents a novel three-stage training framework for small models, integrating temporal consistency and sub-action decomposition for enhanced video reasoning.

Findings

01

Achieves up to 78.9% accuracy on Kinetics-400 with 3B parameters.

02

Outperforms baseline models by 12-18 points across datasets.

03

Validates effectiveness of progressive training in small models.

Abstract

While recent multimodal models have shown progress in vision-language tasks, small-scale variants still struggle with the fine-grained temporal reasoning required for video understanding. We introduce ReasonAct, a method that enhances video reasoning in smaller models through a three-stage training process: first building a foundation with text-only reasoning, then fine-tuning on video, and finally refining with temporal-aware reinforcement learning. We build upon Temporal Group Relative Policy Optimization (T-GRPO) by incorporating temporal consistency modeling into policy optimization. We also propose a biomechanically-motivated sub-action decomposition mechanism that provides graduated rewards for constituent action phases. Through experiments on HMDB51, UCF-101, and Kinetics-400, our 3B-parameter model achieves 67.2%, 94.1%, and 78.9% accuracy respectively, demonstrating…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis