ReasonAct: Progressive Training for Fine-Grained Video Reasoning in Small Models
Jiaxin Liu, Zhaolu Kang

TL;DR
ReasonAct introduces a three-stage progressive training method that significantly improves small model performance in fine-grained video reasoning tasks, combining text reasoning, video fine-tuning, and temporal reinforcement learning.
Contribution
The paper presents a novel three-stage training framework for small models, integrating temporal consistency and sub-action decomposition for enhanced video reasoning.
Findings
Achieves up to 78.9% accuracy on Kinetics-400 with 3B parameters.
Outperforms baseline models by 12-18 points across datasets.
Validates effectiveness of progressive training in small models.
Abstract
While recent multimodal models have shown progress in vision-language tasks, small-scale variants still struggle with the fine-grained temporal reasoning required for video understanding. We introduce ReasonAct, a method that enhances video reasoning in smaller models through a three-stage training process: first building a foundation with text-only reasoning, then fine-tuning on video, and finally refining with temporal-aware reinforcement learning. We build upon Temporal Group Relative Policy Optimization (T-GRPO) by incorporating temporal consistency modeling into policy optimization. We also propose a biomechanically-motivated sub-action decomposition mechanism that provides graduated rewards for constituent action phases. Through experiments on HMDB51, UCF-101, and Kinetics-400, our 3B-parameter model achieves 67.2%, 94.1%, and 78.9% accuracy respectively, demonstrating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis
