3DLLM-Mem: Long-Term Spatial-Temporal Memory for Embodied 3D Large Language Model

Wenbo Hu; Yining Hong; Yanjun Wang; Leison Gao; Zibu Wei; Xingcheng Yao; Nanyun Peng; Yonatan Bitton; Idan Szpektor; Kai-Wei Chang

arXiv:2505.22657·cs.CV·December 18, 2025

3DLLM-Mem: Long-Term Spatial-Temporal Memory for Embodied 3D Large Language Model

Wenbo Hu, Yining Hong, Yanjun Wang, Leison Gao, Zibu Wei, Xingcheng Yao, Nanyun Peng, Yonatan Bitton, Idan Szpektor, Kai-Wei Chang

PDF

Open Access

TL;DR

This paper introduces 3DLLM-Mem, a novel long-term spatial-temporal memory model for embodied 3D language tasks, along with a comprehensive benchmark, 3DMem-Bench, to evaluate long-term reasoning in dynamic environments.

Contribution

The paper presents a new memory model for LLMs that enhances long-term reasoning in 3D environments and introduces a large benchmark for evaluation.

Findings

01

3DLLM-Mem outperforms baselines by 16.5% in success rate.

02

The benchmark includes over 26,000 trajectories and 2,892 tasks.

03

The model effectively fuses spatial-temporal information for improved reasoning.

Abstract

Humans excel at performing complex tasks by leveraging long-term memory across temporal and spatial experiences. In contrast, current Large Language Models (LLMs) struggle to effectively plan and act in dynamic, multi-room 3D environments. We posit that part of this limitation is due to the lack of proper 3D spatial-temporal memory modeling in LLMs. To address this, we first introduce 3DMem-Bench, a comprehensive benchmark comprising over 26,000 trajectories and 2,892 embodied tasks, question-answering and captioning, designed to evaluate an agent's ability to reason over long-term memory in 3D environments. Second, we propose 3DLLM-Mem, a novel dynamic memory management and fusion model for embodied spatial-temporal reasoning and actions in LLMs. Our model uses working memory tokens, which represents current observations, as queries to selectively attend to and fuse the most useful…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Constraint Satisfaction and Optimization · Topic Modeling

MethodsFocus