VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning

Qi Wang; Yanrui Yu; Ye Yuan; Rui Mao; Tianfei Zhou

arXiv:2505.12434·cs.CV·October 15, 2025

VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning

Qi Wang, Yanrui Yu, Ye Yuan, Rui Mao, Tianfei Zhou

PDF

Open Access 1 Repo 4 Models 1 Datasets

TL;DR

VideoRFT introduces a novel reinforcement fine-tuning approach to enhance multi-modal large language models' video reasoning abilities, leveraging new datasets and a semantic consistency reward to achieve state-of-the-art results.

Contribution

The paper presents VideoRFT, a new method extending reinforcement fine-tuning to video reasoning in MLLMs, including a reasoning dataset and a semantic consistency reward.

Findings

01

Achieved state-of-the-art performance on six video reasoning benchmarks.

02

Developed two new datasets: VideoRFT-CoT-102K and VideoRFT-RL-310K.

03

Introduced a semantic consistency reward to improve reasoning grounded in visual evidence.

Abstract

Reinforcement fine-tuning (RFT) has shown great promise in achieving humanlevel reasoning capabilities of Large Language Models (LLMs), and has recently been extended to MLLMs. Nevertheless, reasoning about videos, which is a fundamental aspect of human intelligence, remains a persistent challenge due to the complex logic, temporal and causal structures inherent in video data. To fill this gap, we propose VideoRFT, a novel approach that extends the RFT paradigm to cultivate human-like video reasoning capabilities in MLLMs. VideoRFT follows the standard two-stage scheme in RFT: supervised fine-tuning (SFT) with chain-of-thought (CoT) annotations, followed by reinforcement learning (RL) to improve generalization. A central challenge to achieve this in the video domain lies in the scarcity of large-scale, high-quality video CoT datasets. We address this by building a multi-expert-driven,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

qiwang98/videorft
pytorchOfficial

Models

Datasets

QiWang98/VideoRFT-Data
dataset· 1.0k dl
1.0k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis

MethodsShrink and Fine-Tune