IQViC: In-context, Question Adaptive Vision Compressor for Long-term   Video Understanding LMMs

Sosuke Yamao; Natsuki Miyahara; Yuki Harazono; Shun Takeuchi

arXiv:2412.09907·cs.CV·December 17, 2024

IQViC: In-context, Question Adaptive Vision Compressor for Long-term Video Understanding LMMs

Sosuke Yamao, Natsuki Miyahara, Yuki Harazono, Shun Takeuchi

PDF

Open Access

TL;DR

IQViC introduces a transformer-based visual compressor that enhances long-term video understanding by selectively compressing relevant information conditioned on questions, improving accuracy and memory efficiency.

Contribution

The paper proposes IQViC, a novel question-adaptive visual compressor that enables efficient long-term video understanding by reducing memory requirements and focusing on relevant content.

Findings

01

Outperforms existing methods in accuracy on long-term video QA tasks.

02

Reduces memory token usage significantly compared to full feature methods.

03

Effective on a new InfiniBench-based dataset and standard benchmarks.

Abstract

With the increasing complexity of video data and the need for more efficient long-term temporal understanding, existing long-term video understanding methods often fail to accurately capture and analyze extended video sequences. These methods typically struggle to maintain performance over longer durations and to handle the intricate dependencies within the video content. To address these limitations, we propose a simple yet effective large multi-modal model framework for long-term video understanding that incorporates a novel visual compressor, the In-context, Question Adaptive Visual Compressor (IQViC). The key idea, inspired by humans' selective attention and in-context memory mechanisms, is to introduce a novel visual compressor and incorporate efficient memory management techniques to enhance long-term video question answering. Our framework utilizes IQViC, a transformer-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition

MethodsSoftmax · Attention Is All You Need