3D-RFT: Reinforcement Fine-Tuning for Video-based 3D Scene Understanding

Xiongkun Linghu; Jiangyong Huang; Baoxiong Jia; Siyuan Huang

arXiv:2603.04976·cs.CV·March 6, 2026

3D-RFT: Reinforcement Fine-Tuning for Video-based 3D Scene Understanding

Xiongkun Linghu, Jiangyong Huang, Baoxiong Jia, Siyuan Huang

PDF

Open Access

TL;DR

This paper introduces 3D-RFT, a reinforcement fine-tuning framework that directly optimizes large language models for video-based 3D scene understanding tasks using verifiable reward functions, achieving state-of-the-art results.

Contribution

The paper presents the first application of reinforcement fine-tuning with verifiable rewards to video-based 3D perception, improving alignment with evaluation metrics over traditional supervised methods.

Findings

01

3D-RFT-4B outperforms larger models on 3D detection and reasoning tasks.

02

Reinforcement fine-tuning with task-specific rewards enhances model performance.

03

The framework demonstrates robustness and valuable training insights.

Abstract

Reinforcement Learning with Verifiable Rewards ( RLVR ) has emerged as a transformative paradigm for enhancing the reasoning capabilities of Large Language Models ( LLMs), yet its potential in 3D scene understanding remains under-explored. Existing approaches largely rely on Supervised Fine-Tuning ( SFT), where the token-level cross-entropy loss acts as an indirect proxy for optimization, leading to a misalignment between training objectives and task performances. To bridge this gap, we present Reinforcement Fine-Tuning for Video-based 3D Scene Understanding (3D-RFT ), the first framework to extend RLVR to video-based 3D perception and reasoning. 3D-RFT shifts the paradigm by directly optimizing the model towards evaluation metrics. 3D-RFT first activates 3D-aware Multi-modal Large Language Models ( MLLM s) via SFT, followed by reinforcement fine-tuning using Group Relative Policy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications