Hierarchical Memory for Long Video QA
Yiqin Wang, Haoji Zhang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng, Dai, Xiaojie Jin

TL;DR
This paper presents a hierarchical memory approach called STAR Memory for efficient long video question-answering, enabling processing of lengthy videos with limited GPU memory while maintaining high accuracy.
Contribution
We adapt and fine-tune the STAR Memory mechanism from Flash-VStream for long video QA, achieving state-of-the-art results in the CVPR'24 LOVEU Challenge.
Findings
Achieved 1st place in the LOVEU Challenge @ CVPR'24
Effectively reduces memory and latency in long video processing
Maintains high QA accuracy with limited GPU resources
Abstract
This paper describes our champion solution to the LOVEU Challenge @ CVPR'24, Track 1 (Long Video VQA). Processing long sequences of visual tokens is computationally expensive and memory-intensive, making long video question-answering a challenging task. The key is to compress visual tokens effectively, reducing memory footprint and decoding latency, while preserving the essential information for accurate question-answering. We adopt a hierarchical memory mechanism named STAR Memory, proposed in Flash-VStream, that is capable of processing long videos with limited GPU memory (VRAM). We further utilize the video and audio data of MovieChat-1K training set to fine-tune the pretrained weight released by Flash-VStream, achieving 1st place in the challenge. Code is available at project homepage https://invinciblewyq.github.io/vstream-page .
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Compression Techniques · CCD and CMOS Imaging Sensors · Advanced Image and Video Retrieval Techniques
MethodsSparse Evolutionary Training
