Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning

Ziyang Wang; Jaehong Yoon; Shoubin Yu; Md Mohaiminul Islam; Gedas Bertasius; Mohit Bansal

arXiv:2507.06485·cs.CV·October 27, 2025

Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning

Ziyang Wang, Jaehong Yoon, Shoubin Yu, Md Mohaiminul Islam, Gedas Bertasius, Mohit Bansal

PDF

Open Access 1 Models 1 Video

TL;DR

Video-RTS introduces a data-efficient reinforcement learning approach combined with adaptive test-time scaling for improved video reasoning, reducing reliance on extensive fine-tuning and annotations while achieving superior accuracy on benchmarks.

Contribution

It proposes a novel combination of pure RL training and a sparse-to-dense TTS strategy, significantly enhancing data efficiency and inference performance in video reasoning tasks.

Findings

01

Surpasses existing models by 2.4% accuracy with only 3.6% of training data

02

Achieves a 4.2% improvement on Video-Holmes benchmark

03

Eliminates the need for large-scale supervised fine-tuning

Abstract

Despite advances in reinforcement learning (RL)-based video reasoning with large language models (LLMs), data collection and fine-tuning remain significant challenges. These methods often rely on large-scale supervised fine-tuning (SFT) with extensive video data and long Chain-of-Thought (CoT) annotations, making them costly and hard to scale. To address this, we present Video-RTS, a new approach to improve video reasoning capability with drastically improved data efficiency by combining data-efficient RL with a video-adaptive test-time scaling (TTS) strategy. Building on observations about the data scaling, we skip the resource-intensive SFT step and employ efficient pure-RL training with output-based rewards, requiring no additional annotations or extensive fine-tuning. Furthermore, to utilize computational resources more efficiently, we introduce a sparse-to-dense video TTS strategy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Ted412/Video-RTS
model· 119 dl· ♡ 2
119 dl♡ 2

Videos

Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Reinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning