VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization

Yunxin Li; Xinyu Chen; Zitao Li; Zhenyu Liu; Longyue Wang; Wenhan Luo; Baotian Hu; Min Zhang

arXiv:2505.19000·cs.CL·May 27, 2025

VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization

Yunxin Li, Xinyu Chen, Zitao Li, Zhenyu Liu, Longyue Wang, Wenhan Luo, Baotian Hu, Min Zhang

PDF

Open Access 1 Repo 1 Models

TL;DR

VerIPO introduces a verifier-guided iterative policy optimization framework that enhances Video-LLMs' ability to generate long, coherent reasoning chains efficiently, surpassing existing methods in performance and stability.

Contribution

The paper presents a novel verifier-guided iterative training loop that improves long-term reasoning in Video-LLMs, addressing data quality and stability issues in reinforcement fine-tuning.

Findings

01

Faster and more effective optimization than standard GRPO.

02

Models outperform instruction-tuned Video-LLMs in reasoning tasks.

03

One iteration of VerIPO surpasses some state-of-the-art models.

Abstract

Applying Reinforcement Learning (RL) to Video Large Language Models (Video-LLMs) shows significant promise for complex video reasoning. However, popular Reinforcement Fine-Tuning (RFT) methods, such as outcome-based Group Relative Policy Optimization (GRPO), are limited by data preparation bottlenecks (e.g., noise or high cost) and exhibit unstable improvements in the quality of long chain-of-thoughts (CoTs) and downstream performance.To address these limitations, we propose VerIPO, a Verifier-guided Iterative Policy Optimization method designed to gradually improve video LLMs' capacity for generating deep, long-term reasoning chains. The core component is Rollout-Aware Verifier, positioned between the GRPO and Direct Preference Optimization (DPO) training phases to form the GRPO-Verifier-DPO training loop. This verifier leverages small LLMs as a judge to assess the reasoning logic of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hitsz-tmg/veripo
pytorchOfficial

Models

🤗
Uni-MoE/VerIPO-7B-v1.0
model· 7 dl
7 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics

MethodsDirect Preference Optimization