Visual-RFT: Visual Reinforcement Fine-Tuning
Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong, Duan, Dahua Lin, Jiaqi Wang

TL;DR
Visual-RFT extends reinforcement fine-tuning to visual tasks, leveraging large vision-language models and verifiable reward functions to improve performance and generalization in various image understanding benchmarks.
Contribution
This work introduces Visual-RFT, a novel reinforcement fine-tuning method for visual tasks using verifiable reward functions and policy optimization, expanding RFT applications beyond language models.
Findings
Improves accuracy by 24.3% in one-shot fine-grained image classification.
Outperforms baseline by 21.9 points in two-shot few-shot object detection on COCO.
Demonstrates strong generalization across multiple visual reasoning benchmarks.
Abstract
Reinforcement Fine-Tuning (RFT) in Large Reasoning Models like OpenAI o1 learns from feedback on its answers, which is especially useful in applications when fine-tuning data is scarce. Recent open-source work like DeepSeek-R1 demonstrates that reinforcement learning with verifiable reward is one key direction in reproducing o1. While the R1-style model has demonstrated success in language models, its application in multi-modal domains remains under-explored. This work introduces Visual Reinforcement Fine-Tuning (Visual-RFT), which further extends the application areas of RFT on visual tasks. Specifically, Visual-RFT first uses Large Vision-Language Models (LVLMs) to generate multiple responses containing reasoning tokens and final answers for each input, and then uses our proposed visual perception verifiable reward functions to update the model via the policy optimization algorithm…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Interactive and Immersive Displays
