Recurrent Attention-based Token Selection for Efficient Streaming Video-LLMs
Vaggelis Dorovatas, Soroush Seifi, Gunshi Gupta, Rahaf Aljundi

TL;DR
This paper introduces a training-free, attention-based token selection method for streaming Video-LLMs that significantly reduces visual token processing while maintaining high performance in real-time video understanding.
Contribution
It presents a novel, training-free approach that uses LLM-informed token selection and recurrent processing to enable efficient streaming video analysis with minimal performance loss.
Findings
Discards up to 95% of unimportant visual tokens with minimal performance impact
Achieves state-of-the-art results on streaming video benchmarks
Balances efficiency and effectiveness in real-time video understanding
Abstract
Video Large Language Models (Video-LLMs) excel at understanding videos in-context, provided they have full access to the video when answering queries. However, these models face challenges in streaming scenarios where hour-long videos must be processed online, and questions need timely responses. In this work, we propose a training-free approach compatible with standard Video-LLMs, leveraging three key concepts: 1) LLM-informed selection of visual tokens to identify those that the LLM has attended to and contributed to its understanding of each short clip. Our attention-based selection allows us to discard up to ~95% of unimportant visual tokens with minimal performance loss; 2) Recurrent processing of past selected tokens to generate temporally coherent understanding of each processed clip; 3) Caption-based question answering for lightweight and accurate responses. Our method achieves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
