WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning

Woongyeong Yeo; Kangsan Kim; Jaehong Yoon; Sung Ju Hwang

arXiv:2512.02425·cs.CV·March 30, 2026

WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning

Woongyeong Yeo, Kangsan Kim, Jaehong Yoon, Sung Ju Hwang

PDF

1 Repo

TL;DR

WorldMM is a multimodal memory agent that enhances long video reasoning by integrating visual and textual memories across multiple temporal scales, significantly outperforming existing methods.

Contribution

It introduces a novel multimodal memory architecture with adaptive retrieval for long video question answering, addressing limitations of previous text-only or fixed-scale memory methods.

Findings

01

Achieves an average 8.4% performance improvement over state-of-the-art baselines.

02

Effectively utilizes multiple memory types and temporal scales for complex scene reasoning.

03

Demonstrates strong results across five long video question-answering benchmarks.

Abstract

Recent advances in video large language models have demonstrated strong capabilities in understanding short clips. However, scaling them to hours- or days-long videos remains highly challenging due to limited context capacity and the loss of critical visual details during abstraction. Existing memory-augmented methods mitigate this by leveraging textual summaries of video segments, yet they heavily rely on text and fail to utilize visual evidence when reasoning over complex scenes. Moreover, retrieving from fixed temporal scales further limits their flexibility in capturing events that span variable durations. To address this, we introduce WorldMM, a novel multimodal memory agent that constructs and retrieves from multiple complementary memories, encompassing both textual and visual representations. WorldMM comprises three types of memory: episodic memory indexes factual events across…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wgcyeo/WorldMM
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.