Re-ranking Reasoning Context with Tree Search Makes Large Vision-Language Models Stronger
Qi Yang, Chenghao Zhang, Lubin Fan, Kun Ding, Jieping Ye, Shiming Xiang

TL;DR
This paper introduces RCTS, a multimodal RAG framework that enhances large vision-language models by constructing a reasoning-enriched knowledge base and employing tree search re-ranking, leading to state-of-the-art VQA performance.
Contribution
It proposes a novel RAG framework with a reasoning-enriched knowledge base and Monte Carlo Tree Search re-ranking, improving reasoning and response consistency in LVLMs.
Findings
Achieves state-of-the-art results on multiple VQA datasets.
Outperforms In-Context Learning and Vanilla-RAG methods.
Enhances reasoning and response consistency in LVLMs.
Abstract
Recent advancements in Large Vision Language Models (LVLMs) have significantly improved performance in Visual Question Answering (VQA) tasks through multimodal Retrieval-Augmented Generation (RAG). However, existing methods still face challenges, such as the scarcity of knowledge with reasoning examples and erratic responses from retrieved knowledge. To address these issues, in this study, we propose a multimodal RAG framework, termed RCTS, which enhances LVLMs by constructing a Reasoning Context-enriched knowledge base and a Tree Search re-ranking method. Specifically, we introduce a self-consistent evaluation mechanism to enrich the knowledge base with intrinsic reasoning patterns. We further propose a Monte Carlo Tree Search with Heuristic Rewards (MCTS-HR) to prioritize the most relevant examples. This ensures that LVLMs can leverage high-quality contextual reasoning for better and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
