Efficient Multimodal Planning Agent for Visual Question-Answering
Zhuo Chen, Xinyu Geng, Xinyu Wang, Yong Jiang, Zhen Zhang, Pengjun Xie, Kewei Tu

TL;DR
This paper introduces a multimodal planning agent for VQA that intelligently decomposes retrieval steps, significantly reducing computation time while maintaining or improving accuracy across multiple datasets.
Contribution
It presents a novel training method for a multimodal planning agent that optimizes the efficiency-effectiveness trade-off in VQA tasks by dynamically decomposing the mRAG pipeline.
Findings
Reduces search time by over 60% compared to existing methods.
Outperforms baseline methods on six datasets.
Decreases costly tool calls while maintaining accuracy.
Abstract
Visual Question-Answering (VQA) is a challenging multimodal task that requires integrating visual and textual information to generate accurate responses. While multimodal Retrieval-Augmented Generation (mRAG) has shown promise in enhancing VQA systems by providing more evidence on both image and text sides, the default procedure that addresses VQA queries, especially the knowledge-intensive ones, often relies on multi-stage pipelines of mRAG with inherent dependencies. To mitigate the inefficiency limitations while maintaining VQA task performance, this paper proposes a method that trains a multimodal planning agent, dynamically decomposing the mRAG pipeline to solve the VQA task. Our method optimizes the trade-off between efficiency and effectiveness by training the agent to intelligently determine the necessity of each mRAG step. In our experiments, the agent can help reduce redundant…
Peer Reviews
Decision·Submitted to ICLR 2026
Practical Significance: Addressing the inefficiency of rigid RAG pipelines is a highly relevant problem for real-world deployment of MLLMs. The reported >60% reduction in search time is a substantial practical improvement. Strong Empirical Results: The approach strikes a balance of efficiency with strong performances. It manages to match or outperform the computationally expensive "$+k_{i,t}$" full RAG baseline, while using a fraction of the retrieval calls. Transferability: A strength of the
Dependency on Oracle for Training Data: Automated data annotation has a strong dependency on establishing if a model can answer a question with/without RAG. This may be inherently noisy if the base model used for annotation itself has no clearly demarcated performance boundaries. The authors generate the gold query with a strong model, Qwen-72B, but perhaps biases in that model's knowledge boundary propagate to the agent. Comparison to Baseline for "Agents" is Limited: The main comparison to dy
- The paper identifies a real problem: existing mRAG pipelines suffer from multi-stage dependencies and redundant computations. This is a well-motivated research direction. - The method is essentially a classifier that predicts which retrieval path to take for each VQA input. This makes it easy to implement and integrate into existing systems. The automatic data generation process also avoids expensive manual annotation. - The experiments results show significant reduction in retrieval operation
- My main concern is the significance of the contribution. The proposed method is essentially training an LLM-based classifier for a four-way classification task, rather than a true multi-step planning agent. While effective in practice, the methodological innovation is relatively limited. - The evaluation scope is somewhat narrow: 1) The main results rely solely on LLM Eval scores (0-100) without standard VQA metrics like accuracy, BLEU, or ROUGE. 2) Only Qwen-Max is used for evaluation, and th
S1: Targets a concrete, high-impact pain point in multimodal RAG for VQA, unnecessary image/text retrieval and inflated context length, by turning the pipeline into an adaptive one. S2: Automated LLM-based data construction (visual query decomposition + correctness checking) enables building large supervision without heavy manual labeling. S3: Broad evaluation on six heterogeneous VQA(-like) datasets, demonstrating generality across dynamic, knowledge-intensive, and easier visual tasks. S4: S
W1: Despite the automated LLM-based data pipeline being a key enabler, the approach still relies heavily on LLM-generated supervision for both decomposition and correctness checking, but the paper does not quantify annotation noise or its impact. W2: Although the evaluation covers six datasets and looks broad, it leans on LLM-based scoring rather than standard VQA metrics or human judgment, which weakens comparability to prior VQA work. W3: While the method reports a 60%+ search-time reduction
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Topic Modeling
