Towards Robust Multi-Modal Reasoning via Model Selection
Xiangyan Liu, Rongxue Li, Wei Ji, Tao Lin

TL;DR
This paper introduces the M^3 framework to improve model selection in multi-modal reasoning agents, enhancing robustness and handling subtask dependencies, supported by a new dataset MS-GQA and experimental validation.
Contribution
The paper proposes a novel plug-in framework for dynamic model selection in multi-modal agents, addressing the fragility caused by fixed model invocation strategies.
Findings
M^3 framework improves robustness of multi-modal agents
Dynamic model selection considers user inputs and subtask dependencies
Experimental results demonstrate enhanced reasoning performance
Abstract
The reasoning capabilities of LLM (Large Language Model) are widely acknowledged in recent research, inspiring studies on tool learning and autonomous agents. LLM serves as the "brain" of the agent, orchestrating multiple tools for collaborative multi-step task solving. Unlike methods invoking tools like calculators or weather APIs for straightforward tasks, multi-modal agents excel by integrating diverse AI models for complex challenges. However, current multi-modal agents neglect the significance of model selection: they primarily focus on the planning and execution phases, and will only invoke predefined task-specific models for each subtask, making the execution fragile. Meanwhile, other traditional model selection methods are either incompatible with or suboptimal for the multi-modal agent scenarios, due to ignorance of dependencies among subtasks arising by multi-step reasoning.…
Peer Reviews
Decision·ICLR 2024 poster
1. This paper adeptly formulates the model selection problem within multi-modal reasoning contexts and constructs the MS-GQA dataset. 2. The paper is well-founded in its pursuit to address the overlooked subtask dependencies in previous works. The proposed M^3 framework innovatively and effectively models the relationship between samples, selected models, and subtask dependencies. 3. The experiments conducted on MS-GQA demonstrate the efficiency and efficacy of the M^3 framework.
1. The primary concern is that model selection is a small part of multi-modal reasoning. It remains to be seen whether it is important for the entire task and how it can benefit real-world applications. The selection method proposed in this paper involves complex proxy training and may need to be more universally applicable or scalable for different reasoning tasks. 2. Lack of reproducibility: The paper must include crucial details, such as the LLM used. The constructed MS-GQA dataset is not ye
Identification of Critical Challenge: The paper recognizes and addresses a significant challenge in multi-modal agents, which is the selection of appropriate models for subtasks, a crucial aspect often overlooked in prior research. Introduction of the M3 Framework: The paper presents the M3 framework, which aims to improve model selection by considering user inputs and subtask dependencies. The framework is designed with negligible runtime overhead at test-time, making it practical for real-wor
Limited Baseline Comparison: The paper could benefit from a more comprehensive comparison of the M3 framework with existing methods. While it claims to outperform traditional model selection methods, a detailed comparison with state-of-the-art techniques would provide a more robust evaluation. Insufficient Experimental Discussion: The discussion of experimental results could be more in-depth. The paper does not thoroughly analyze the scenarios where the M3 framework performs exceptionally well
The paper provides a clear analysis of the challenges. Besides the method, the paper also provides a dataset as one of the contributions. The experimental results show significant improvements.
The method uses a heuristic process to perform selection which the capacity is relying on the pre-trained models themselves. How about the generalization capacity for the zero-shot tasks?
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
MethodsFocus
