Rethinking Reinforcement Fine-Tuning in LVLM: Convergence, Reward Decomposition, and Generalization

Carter Adams; Rafael Oliveira; Gabriel Almeida; Sofia Torres

arXiv:2604.19857·cs.LG·April 23, 2026

Rethinking Reinforcement Fine-Tuning in LVLM: Convergence, Reward Decomposition, and Generalization

Carter Adams, Rafael Oliveira, Gabriel Almeida, Sofia Torres

PDF

TL;DR

This paper provides a theoretical analysis of reinforcement fine-tuning in large vision-language models, introducing a formal framework and deriving convergence, reward decomposition, and generalization results.

Contribution

It introduces the Tool-Augmented Markov Decision Process framework and proves key theorems on convergence, reward decomposition benefits, and out-of-distribution generalization.

Findings

01

GRPO converges at rate O(1/√T) with composite rewards

02

Reward decomposition bounds the sub-optimality gap

03

PAC-Bayes bound explains transferability in Visual-ARFT

Abstract

Reinforcement fine-tuning with verifiable rewards (RLVR) has emerged as a powerful paradigm for equipping large vision-language models (LVLMs) with agentic capabilities such as tool use and multi-step reasoning. Despite striking empirical successes, most notably Visual Agentic Reinforcement Fine-Tuning (Visual-ARFT), the theoretical underpinnings of this paradigm remain poorly understood. In particular, two critical questions lack rigorous answers: (i)~how does the composite structure of verifiable rewards (format compliance, answer accuracy, tool executability) affect the convergence of Group Relative Policy Optimization (GRPO), and (ii)~why does training on a small set of tool-augmented tasks transfer to out-of-distribution domains? We address these gaps by introducing the \emph{Tool-Augmented Markov Decision Process} (TA-MDP), a formal framework that models multimodal agentic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.