Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning
Ziyang Wang, Jaehong Yoon, Shoubin Yu, Md Mohaiminul Islam, Gedas Bertasius, Mohit Bansal

TL;DR
Video-RTS introduces a data-efficient reinforcement learning approach combined with adaptive test-time scaling for improved video reasoning, reducing reliance on extensive fine-tuning and annotations while achieving superior accuracy on benchmarks.
Contribution
It proposes a novel combination of pure RL training and a sparse-to-dense TTS strategy, significantly enhancing data efficiency and inference performance in video reasoning tasks.
Findings
Surpasses existing models by 2.4% accuracy with only 3.6% of training data
Achieves a 4.2% improvement on Video-Holmes benchmark
Eliminates the need for large-scale supervised fine-tuning
Abstract
Despite advances in reinforcement learning (RL)-based video reasoning with large language models (LLMs), data collection and fine-tuning remain significant challenges. These methods often rely on large-scale supervised fine-tuning (SFT) with extensive video data and long Chain-of-Thought (CoT) annotations, making them costly and hard to scale. To address this, we present Video-RTS, a new approach to improve video reasoning capability with drastically improved data efficiency by combining data-efficient RL with a video-adaptive test-time scaling (TTS) strategy. Building on observations about the data scaling, we skip the resource-intensive SFT step and employ efficient pure-RL training with output-based rewards, requiring no additional annotations or extensive fine-tuning. Furthermore, to utilize computational resources more efficiently, we introduce a sparse-to-dense video TTS strategy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Reinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning
