Temporal Working Memory: Query-Guided Segment Refinement for Enhanced   Multimodal Understanding

Xingjian Diao; Chunhui Zhang; Weiyi Wu; Zhongyu Ouyang; Peijun Qing,; Ming Cheng; Soroush Vosoughi; Jiang Gui

arXiv:2502.06020·cs.CV·February 11, 2025

Temporal Working Memory: Query-Guided Segment Refinement for Enhanced Multimodal Understanding

Xingjian Diao, Chunhui Zhang, Weiyi Wu, Zhongyu Ouyang, Peijun Qing,, Ming Cheng, Soroush Vosoughi, Jiang Gui

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a temporal working memory module that improves multimodal foundation models' ability to process extended temporal sequences in videos and audio by selectively retaining relevant information, leading to significant performance gains.

Contribution

The paper proposes a novel, plug-and-play temporal working memory module that enhances existing multimodal models' temporal understanding by focusing on task-relevant segments using query-guided attention.

Findings

01

Significant performance improvements on video captioning, question answering, and retrieval tasks.

02

Effective integration of TWM into nine state-of-the-art models.

03

Enhanced handling of complex, time-sensitive multimodal data.

Abstract

Multimodal foundation models (MFMs) have demonstrated significant success in tasks such as visual captioning, question answering, and image-text retrieval. However, these models face inherent limitations due to their finite internal capacity, which restricts their ability to process extended temporal sequences, a crucial requirement for comprehensive video and audio analysis. To overcome these challenges, we introduce a specialized cognitive module, temporal working memory (TWM), which aims to enhance the temporal modeling capabilities of MFMs. It selectively retains task-relevant information across temporal dimensions, ensuring that critical details are preserved throughout the processing of video and audio content. The TWM uses a query-guided attention approach to focus on the most informative multimodal segments within temporal sequences. By retaining only the most relevant content,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xid32/naacl_2025_twm
pytorchOfficial

Videos

Temporal Working Memory: Query-Guided Segment Refinement for Enhanced Multimodal Understanding· underline

Taxonomy

TopicsSpeech and dialogue systems

MethodsSoftmax · Attention Is All You Need · Focus