Progressive Multimodal Reasoning via Active Retrieval
Guanting Dong, Chenghao Zhang, Mengjie Deng, Yutao Zhu, Zhicheng Dou,, Ji-Rong Wen

TL;DR
This paper introduces AR-MCTS, a framework that enhances multimodal large language models' reasoning abilities through active retrieval and Monte Carlo Tree Search, improving accuracy and diversity in complex tasks.
Contribution
The paper presents a novel AR-MCTS framework combining active retrieval and MCTS for progressive multimodal reasoning, with a process reward model for verification.
Findings
AR-MCTS improves performance on complex multimodal reasoning benchmarks.
The framework enhances reasoning diversity and reliability.
Experimental results confirm effectiveness across multiple models.
Abstract
Multi-step multimodal reasoning tasks pose significant challenges for multimodal large language models (MLLMs), and finding effective ways to enhance their performance in such scenarios remains an unresolved issue. In this paper, we propose AR-MCTS, a universal framework designed to progressively improve the reasoning capabilities of MLLMs through Active Retrieval (AR) and Monte Carlo Tree Search (MCTS). Our approach begins with the development of a unified retrieval module that retrieves key supporting insights for solving complex reasoning problems from a hybrid-modal retrieval corpus. To bridge the gap in automated multimodal reasoning verification, we employ the MCTS algorithm combined with an active retrieval mechanism, which enables the automatic generation of step-wise annotations. This strategy dynamically retrieves key insights for each reasoning step, moving beyond traditional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Speech and dialogue systems
