Mem2Ego: Empowering Vision-Language Models with Global-to-Ego Memory for Long-Horizon Embodied Navigation

Lingfeng Zhang; Yuecheng Liu; Zhanguang Zhang; Matin Aghaei; Yaochen Hu; Hongjian Gu; Mohammad Ali Alomrani; David Gamaliel Arcos Bravo; Raika Karimi; Atia Hamidizadeh; Haoping Xu; Guowei Huang; Zhanpeng Zhang; Tongtong Cao; Weichao Qiu; Xingyue Quan; Jianye Hao; Yuzheng Zhuang; Yingxue Zhang

arXiv:2502.14254·cs.RO·June 12, 2025

Mem2Ego: Empowering Vision-Language Models with Global-to-Ego Memory for Long-Horizon Embodied Navigation

Lingfeng Zhang, Yuecheng Liu, Zhanguang Zhang, Matin Aghaei, Yaochen Hu, Hongjian Gu, Mohammad Ali Alomrani, David Gamaliel Arcos Bravo, Raika Karimi, Atia Hamidizadeh, Haoping Xu, Guowei Huang, Zhanpeng Zhang, Tongtong Cao, Weichao Qiu, Xingyue Quan, Jianye Hao, Yuzheng Zhuang

PDF

Open Access

TL;DR

Mem2Ego introduces a novel framework that combines global memory retrieval with egocentric visual inputs, significantly improving spatial reasoning and decision-making in long-horizon embodied navigation tasks.

Contribution

The paper presents a new vision-language navigation approach that adaptively integrates global memory with local observations, addressing limitations of previous methods in complex environments.

Findings

01

Outperforms previous state-of-the-art in object navigation tasks

02

Enhances spatial reasoning through global-to-ego memory integration

03

Demonstrates scalability and effectiveness in complex environments

Abstract

Recent advancements in Large Language Models (LLMs) and Vision-Language Models (VLMs) have made them powerful tools in embodied navigation, enabling agents to leverage commonsense and spatial reasoning for efficient exploration in unfamiliar environments. Existing LLM-based approaches convert global memory, such as semantic or topological maps, into language descriptions to guide navigation. While this improves efficiency and reduces redundant exploration, the loss of geometric information in language-based representations hinders spatial reasoning, especially in intricate environments. To address this, VLM-based approaches directly process ego-centric visual inputs to select optimal directions for exploration. However, relying solely on a first-person perspective makes navigation a partially observed decision-making problem, leading to suboptimal decisions in complex environments. In…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Natural Language Processing Techniques