MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents

Junpeng Yue; Xinrun Xu; B\"orje F. Karlsson; and Zongqing Lu

arXiv:2410.03450·cs.LG·May 23, 2025

MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents

Junpeng Yue, Xinrun Xu, B\"orje F. Karlsson, and Zongqing Lu

PDF

Open Access 3 Reviews

TL;DR

This paper introduces MART, a novel method that fine-tunes multimodal large language models as retrievers for embodied agents, improving trajectory selection and task success in unseen environments through interaction-based preference learning and trajectory abstraction.

Contribution

It proposes a new paradigm of using fine-tuned MLLMs as retrievers for embodied agents, incorporating interaction data and trajectory abstraction to enhance multimodal retrieval.

Findings

01

Significant improvement in task success rates in unseen scenes.

02

Effective trajectory summarization with Trajectory Abstraction.

03

Enhanced understanding of trajectory milestones for better decision-making.

Abstract

MLLM agents demonstrate potential for complex embodied tasks by retrieving multimodal task-relevant trajectory data. However, current retrieval methods primarily focus on surface-level similarities of textual or visual cues in trajectories, neglecting their effectiveness for the specific task at hand. To address this issue, we propose a novel method, MLLM As ReTriever (MART), which enhances the performance of embodied agents by utilizing interaction data to fine-tune an MLLM retriever based on preference learning, such that the retriever fully considers the effectiveness of trajectories and prioritizes them for unseen tasks. We also introduce Trajectory Abstraction, a mechanism that leverages MLLMs' summarization capabilities to represent trajectories with fewer tokens while preserving key information, enabling agents to better comprehend milestones in the trajectory. Experimental…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. The paper proposes a unique method for enhancing multimodal retrieval in embodied agents by leveraging preference learning and trajectory abstraction. This approach addresses a critical gap in current retrieval methods that often focus on surface-level similarities. 2. The introduction of Trajectory Abstraction is a valuable contribution. It effectively reduces the complexity of trajectories while maintaining essential information 3. The experimental results are comprehensive and demonstrate

Weaknesses

1. While the paper demonstrates the effectiveness of MART in several environments, the scope of experiments could be expanded to include more diverse and challenging scenarios to further validate the robustness of the method. 2. The paper does not provide a detailed analysis of the computational complexity and resource requirements of the proposed method. This information is crucial for practical implementation and comparison with existing methods. 3. The paper mentions the use of interaction da

Reviewer 02Rating 6Confidence 3

Strengths

1. This paper proposes a new retrieval-augmented MLLM agent, which finds the most matched expert reference trajectory to enhance current trajectory planning. To compress these multimodal trajectories with lots of images, actions, and feedback, Trajectory Abstraction (GPT-4o) is introduced to identify important trajectory milestones. 2. The experimental results demonstrate the effectiveness of MART in improving task success rates across different environments.

Weaknesses

I am not an expert in this field, but I have several problems: --- 1. This work aims to tackle the challenge of grounding in environments. However, the literature review focuses more on embodied agents, memory retrieval in agents, and MLLM. These areas encompass a wide range of topics and, as a result, **overlook numerous specific and significant studies concerning how past embodied grounding approaches have tackled this issue**. In summary, I am unable to identify key references within this m

Reviewer 03Rating 8Confidence 4

Strengths

1. The problem addressed in this paper—decision-making in embodied tasks—is a crucial area. 2. The writing in this paper is clear and well-structured, with each paragraph presenting ideas in a logical, cohesive manner. The authors effectively communicate complex concepts, making the paper accessible and easy to follow. 3. This paper introduces an innovative multimodal retrieval tool and selects appropriate baselines to validate the effectiveness of the proposed method.

Weaknesses

1. The methodology in this paper shows some similarities with previous works, such as LLM-Planner (arXiv:2212.04088 ) and P-RAG( arXiv:2409.11279). While there are innovations in the retriever design, the overall novelty of the approach is somewhat reduced. 2. The paper lacks some detailed analysis and deeper insights. For instance, the ablation study does not include experiments explaining HOW Trajectory Abstraction contributes to performance improvement

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems · Semantic Web and Ontologies

MethodsFocus