Hierarchical Memory for Long Video QA

Yiqin Wang; Haoji Zhang; Yansong Tang; Yong Liu; Jiashi Feng; Jifeng; Dai; Xiaojie Jin

arXiv:2407.00603·cs.CV·December 17, 2024

Hierarchical Memory for Long Video QA

Yiqin Wang, Haoji Zhang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng, Dai, Xiaojie Jin

PDF

Open Access

TL;DR

This paper presents a hierarchical memory approach called STAR Memory for efficient long video question-answering, enabling processing of lengthy videos with limited GPU memory while maintaining high accuracy.

Contribution

We adapt and fine-tune the STAR Memory mechanism from Flash-VStream for long video QA, achieving state-of-the-art results in the CVPR'24 LOVEU Challenge.

Findings

01

Achieved 1st place in the LOVEU Challenge @ CVPR'24

02

Effectively reduces memory and latency in long video processing

03

Maintains high QA accuracy with limited GPU resources

Abstract

This paper describes our champion solution to the LOVEU Challenge @ CVPR'24, Track 1 (Long Video VQA). Processing long sequences of visual tokens is computationally expensive and memory-intensive, making long video question-answering a challenging task. The key is to compress visual tokens effectively, reducing memory footprint and decoding latency, while preserving the essential information for accurate question-answering. We adopt a hierarchical memory mechanism named STAR Memory, proposed in Flash-VStream, that is capable of processing long videos with limited GPU memory (VRAM). We further utilize the video and audio data of MovieChat-1K training set to fine-tune the pretrained weight released by Flash-VStream, achieving 1st place in the challenge. Code is available at project homepage https://invinciblewyq.github.io/vstream-page .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Compression Techniques · CCD and CMOS Imaging Sensors · Advanced Image and Video Retrieval Techniques

MethodsSparse Evolutionary Training