Memory Helps, but Confabulation Misleads: Understanding Streaming Events   in Videos with MLLMs

Gengyuan Zhang; Mingcong Ding; Tong Liu; Yao Zhang; Volker; Tresp

arXiv:2502.15457·cs.CV·February 24, 2025

Memory Helps, but Confabulation Misleads: Understanding Streaming Events in Videos with MLLMs

Gengyuan Zhang, Mingcong Ding, Tong Liu, Yao Zhang, Volker, Tresp

PDF

TL;DR

This paper explores how memory enhances video understanding in multimodal large language models, highlighting the benefits and pitfalls of confabulation, and proposes a method to mitigate misinformation from memory predictions.

Contribution

It introduces a confabulation-aware memory modification technique to improve event understanding in MLLMs by reducing misinformation from memory confabulation.

Findings

01

Memory improves video event understanding in MLLMs.

02

Confabulation can lead to misinformation and degrade performance.

03

The proposed method mitigates confabulation effects.

Abstract

Multimodal large language models (MLLMs) have demonstrated strong performance in understanding videos holistically, yet their ability to process streaming videos-videos are treated as a sequence of visual events-remains underexplored. Intuitively, leveraging past events as memory can enrich contextual and temporal understanding of the current event. In this paper, we show that leveraging memories as contexts helps MLLMs better understand video events. However, because such memories rely on predictions of preceding events, they may contain misinformation, leading to confabulation and degraded performance. To address this, we propose a confabulation-aware memory modification method that mitigates confabulated memory for memory-enhanced event understanding.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.