IQViC: In-context, Question Adaptive Vision Compressor for Long-term Video Understanding LMMs
Sosuke Yamao, Natsuki Miyahara, Yuki Harazono, Shun Takeuchi

TL;DR
IQViC introduces a transformer-based visual compressor that enhances long-term video understanding by selectively compressing relevant information conditioned on questions, improving accuracy and memory efficiency.
Contribution
The paper proposes IQViC, a novel question-adaptive visual compressor that enables efficient long-term video understanding by reducing memory requirements and focusing on relevant content.
Findings
Outperforms existing methods in accuracy on long-term video QA tasks.
Reduces memory token usage significantly compared to full feature methods.
Effective on a new InfiniBench-based dataset and standard benchmarks.
Abstract
With the increasing complexity of video data and the need for more efficient long-term temporal understanding, existing long-term video understanding methods often fail to accurately capture and analyze extended video sequences. These methods typically struggle to maintain performance over longer durations and to handle the intricate dependencies within the video content. To address these limitations, we propose a simple yet effective large multi-modal model framework for long-term video understanding that incorporates a novel visual compressor, the In-context, Question Adaptive Visual Compressor (IQViC). The key idea, inspired by humans' selective attention and in-context memory mechanisms, is to introduce a novel visual compressor and incorporate efficient memory management techniques to enhance long-term video question answering. Our framework utilizes IQViC, a transformer-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition
MethodsSoftmax · Attention Is All You Need
