PeRL: Permutation-Enhanced Reinforcement Learning for Interleaved Vision-Language Reasoning

Yizhen Zhang; Yang Ding; Shuoshuo Zhang; Xinchen Zhang; Haoling Li; Zhong-zhi Li; Peijie Wang; Jie Wu; Lei Ji; Yelong Shen; Yujiu Yang; Yeyun Gong

arXiv:2506.14907·cs.CV·June 19, 2025

PeRL: Permutation-Enhanced Reinforcement Learning for Interleaved Vision-Language Reasoning

Yizhen Zhang, Yang Ding, Shuoshuo Zhang, Xinchen Zhang, Haoling Li, Zhong-zhi Li, Peijie Wang, Jie Wu, Lei Ji, Yelong Shen, Yujiu Yang, Yeyun Gong

PDF

Open Access

TL;DR

PeRL introduces a permutation-enhanced reinforcement learning framework that significantly improves multimodal reasoning across multiple images by enhancing exploration and positional understanding, achieving state-of-the-art results.

Contribution

The paper proposes a novel permutation-based reinforcement learning approach with a multi-stage strategy for interleaved vision-language reasoning, addressing complex multi-image scenarios.

Findings

01

PeRL outperforms existing baselines on multi-image benchmarks.

02

Achieves state-of-the-art performance on 5 multi-image tasks.

03

Maintains competitive results on single-image benchmarks.

Abstract

Inspired by the impressive reasoning capabilities demonstrated by reinforcement learning approaches like DeepSeek-R1, recent emerging research has begun exploring the use of reinforcement learning (RL) to enhance vision-language models (VLMs) for multimodal reasoning tasks. However, most existing multimodal reinforcement learning approaches remain limited to spatial reasoning within single-image contexts, yet still struggle to generalize to more complex and real-world scenarios involving multi-image positional reasoning, where understanding the relationships across images is crucial. To address this challenge, we propose a general reinforcement learning approach PeRL tailored for interleaved multimodal tasks, and a multi-stage strategy designed to enhance the exploration-exploitation trade-off, thereby improving learning efficiency and task performance. Specifically, we introduce…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robotics and Automated Systems · Fuzzy Logic and Control Systems