Recurrent Attention-based Token Selection for Efficient Streaming Video-LLMs

Vaggelis Dorovatas; Soroush Seifi; Gunshi Gupta; Rahaf Aljundi

arXiv:2510.17364·cs.CV·October 21, 2025

Recurrent Attention-based Token Selection for Efficient Streaming Video-LLMs

Vaggelis Dorovatas, Soroush Seifi, Gunshi Gupta, Rahaf Aljundi

PDF

Open Access 1 Video

TL;DR

This paper introduces a training-free, attention-based token selection method for streaming Video-LLMs that significantly reduces visual token processing while maintaining high performance in real-time video understanding.

Contribution

It presents a novel, training-free approach that uses LLM-informed token selection and recurrent processing to enable efficient streaming video analysis with minimal performance loss.

Findings

01

Discards up to 95% of unimportant visual tokens with minimal performance impact

02

Achieves state-of-the-art results on streaming video benchmarks

03

Balances efficiency and effectiveness in real-time video understanding

Abstract

Video Large Language Models (Video-LLMs) excel at understanding videos in-context, provided they have full access to the video when answering queries. However, these models face challenges in streaming scenarios where hour-long videos must be processed online, and questions need timely responses. In this work, we propose a training-free approach compatible with standard Video-LLMs, leveraging three key concepts: 1) LLM-informed selection of visual tokens to identify those that the LLM has attended to and contributed to its understanding of each short clip. Our attention-based selection allows us to discard up to ~95% of unimportant visual tokens with minimal performance loss; 2) Recurrent processing of past selected tokens to generate temporally coherent understanding of each processed clip; 3) Caption-based question answering for lightweight and accurate responses. Our method achieves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Recurrent Attention-based Token Selection for Efficient Streaming Video-LLMs· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis