Question-guided Visual Compression with Memory Feedback for Long-Term Video Understanding

Sosuke Yamao; Natsuki Miyahara; Yuankai Qi; Shun Takeuchi

arXiv:2603.15167·cs.CV·March 17, 2026

Question-guided Visual Compression with Memory Feedback for Long-Term Video Understanding

Sosuke Yamao, Natsuki Miyahara, Yuankai Qi, Shun Takeuchi

PDF

Open Access

TL;DR

This paper introduces QViC-MF, a feedback-driven framework that enhances long-term video understanding by selectively compressing visual information related to questions, leading to significant performance improvements on multiple benchmarks.

Contribution

It proposes a novel question-guided visual compression with memory feedback mechanism that improves understanding of long videos over existing methods.

Findings

01

Achieves 6.1% improvement on MLVU test

02

Achieves 8.3% improvement on LVBench

03

Achieves 18.3% improvement on VNBench Long

Abstract

In the context of long-term video understanding with large multimodal models, many frameworks have been proposed. Although transformer-based visual compressors and memory-augmented approaches are often used to process long videos, they usually compress each frame independently and therefore fail to achieve strong performance on tasks that require understanding complete events, such as temporal ordering tasks in MLVU and VNBench. This motivates us to rethink the conventional one-way scheme from perception to memory, and instead establish a feedbackdriven process in which past visual contexts stored in the context memory can benefit ongoing perception. To this end, we propose Question-guided Visual Compression with Memory Feedback (QViC-MF), a framework for long-term video understanding. At its core is a Question-guided Multimodal Selective Attention (QMSA), which learns to preserve…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques