Temporal Working Memory: Query-Guided Segment Refinement for Enhanced Multimodal Understanding
Xingjian Diao, Chunhui Zhang, Weiyi Wu, Zhongyu Ouyang, Peijun Qing,, Ming Cheng, Soroush Vosoughi, Jiang Gui

TL;DR
This paper introduces a temporal working memory module that improves multimodal foundation models' ability to process extended temporal sequences in videos and audio by selectively retaining relevant information, leading to significant performance gains.
Contribution
The paper proposes a novel, plug-and-play temporal working memory module that enhances existing multimodal models' temporal understanding by focusing on task-relevant segments using query-guided attention.
Findings
Significant performance improvements on video captioning, question answering, and retrieval tasks.
Effective integration of TWM into nine state-of-the-art models.
Enhanced handling of complex, time-sensitive multimodal data.
Abstract
Multimodal foundation models (MFMs) have demonstrated significant success in tasks such as visual captioning, question answering, and image-text retrieval. However, these models face inherent limitations due to their finite internal capacity, which restricts their ability to process extended temporal sequences, a crucial requirement for comprehensive video and audio analysis. To overcome these challenges, we introduce a specialized cognitive module, temporal working memory (TWM), which aims to enhance the temporal modeling capabilities of MFMs. It selectively retains task-relevant information across temporal dimensions, ensuring that critical details are preserved throughout the processing of video and audio content. The TWM uses a query-guided attention approach to focus on the most informative multimodal segments within temporal sequences. By retaining only the most relevant content,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech and dialogue systems
MethodsSoftmax · Attention Is All You Need · Focus
