VReST: Enhancing Reasoning in Large Vision-Language Models through Tree Search and Self-Reward Mechanism

Congzhi Zhang; Jiawei Peng; Zhenglin Wang; Yilong Lai; Haowen Sun; Heng Chang; Fei Ma; Weijiang Yu

arXiv:2506.08691·cs.CV·June 11, 2025

VReST: Enhancing Reasoning in Large Vision-Language Models through Tree Search and Self-Reward Mechanism

Congzhi Zhang, Jiawei Peng, Zhenglin Wang, Yilong Lai, Haowen Sun, Heng Chang, Fei Ma, Weijiang Yu

PDF

Open Access 1 Repo

TL;DR

VReST introduces a training-free method combining tree search and self-reward to significantly improve complex reasoning in large vision-language models, achieving state-of-the-art results in multimodal mathematical reasoning benchmarks.

Contribution

The paper presents VReST, a novel approach using Monte Carlo Tree Search and self-reward mechanisms to enhance reasoning in LVLMs without additional training.

Findings

01

VReST outperforms existing prompting methods on three benchmarks.

02

It demonstrates the effectiveness of test-time scaling laws in multimodal tasks.

03

VReST achieves state-of-the-art performance in multimodal mathematical reasoning.

Abstract

Large Vision-Language Models (LVLMs) have shown exceptional performance in multimodal tasks, but their effectiveness in complex visual reasoning is still constrained, especially when employing Chain-of-Thought prompting techniques. In this paper, we propose VReST, a novel training-free approach that enhances Reasoning in LVLMs through Monte Carlo Tree Search and Self-Reward mechanisms. VReST meticulously traverses the reasoning landscape by establishing a search tree, where each node encapsulates a reasoning step, and each path delineates a comprehensive reasoning sequence. Our innovative multimodal Self-Reward mechanism assesses the quality of reasoning steps by integrating the utility of sub-questions, answer correctness, and the relevance of vision-language clues, all without the need for additional models. VReST surpasses current prompting methods and secures state-of-the-art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

garyjiajia/vrest
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Graph Neural Networks