3D-RFT: Reinforcement Fine-Tuning for Video-based 3D Scene Understanding
Xiongkun Linghu, Jiangyong Huang, Baoxiong Jia, Siyuan Huang

TL;DR
This paper introduces 3D-RFT, a reinforcement fine-tuning framework that directly optimizes large language models for video-based 3D scene understanding tasks using verifiable reward functions, achieving state-of-the-art results.
Contribution
The paper presents the first application of reinforcement fine-tuning with verifiable rewards to video-based 3D perception, improving alignment with evaluation metrics over traditional supervised methods.
Findings
3D-RFT-4B outperforms larger models on 3D detection and reasoning tasks.
Reinforcement fine-tuning with task-specific rewards enhances model performance.
The framework demonstrates robustness and valuable training insights.
Abstract
Reinforcement Learning with Verifiable Rewards ( RLVR ) has emerged as a transformative paradigm for enhancing the reasoning capabilities of Large Language Models ( LLMs), yet its potential in 3D scene understanding remains under-explored. Existing approaches largely rely on Supervised Fine-Tuning ( SFT), where the token-level cross-entropy loss acts as an indirect proxy for optimization, leading to a misalignment between training objectives and task performances. To bridge this gap, we present Reinforcement Fine-Tuning for Video-based 3D Scene Understanding (3D-RFT ), the first framework to extend RLVR to video-based 3D perception and reasoning. 3D-RFT shifts the paradigm by directly optimizing the model towards evaluation metrics. 3D-RFT first activates 3D-aware Multi-modal Large Language Models ( MLLM s) via SFT, followed by reinforcement fine-tuning using Group Relative Policy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
