Rethinking Reinforcement Fine-Tuning in LVLM: Convergence, Reward Decomposition, and Generalization
Carter Adams, Rafael Oliveira, Gabriel Almeida, Sofia Torres

TL;DR
This paper provides a theoretical analysis of reinforcement fine-tuning in large vision-language models, introducing a formal framework and deriving convergence, reward decomposition, and generalization results.
Contribution
It introduces the Tool-Augmented Markov Decision Process framework and proves key theorems on convergence, reward decomposition benefits, and out-of-distribution generalization.
Findings
GRPO converges at rate O(1/√T) with composite rewards
Reward decomposition bounds the sub-optimality gap
PAC-Bayes bound explains transferability in Visual-ARFT
Abstract
Reinforcement fine-tuning with verifiable rewards (RLVR) has emerged as a powerful paradigm for equipping large vision-language models (LVLMs) with agentic capabilities such as tool use and multi-step reasoning. Despite striking empirical successes, most notably Visual Agentic Reinforcement Fine-Tuning (Visual-ARFT), the theoretical underpinnings of this paradigm remain poorly understood. In particular, two critical questions lack rigorous answers: (i)~how does the composite structure of verifiable rewards (format compliance, answer accuracy, tool executability) affect the convergence of Group Relative Policy Optimization (GRPO), and (ii)~why does training on a small set of tool-augmented tasks transfer to out-of-distribution domains? We address these gaps by introducing the \emph{Tool-Augmented Markov Decision Process} (TA-MDP), a formal framework that models multimodal agentic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
