EventSTU: Event-Guided Efficient Spatio-Temporal Understanding for Video Large Language Models

Wenhao Xu; Xin Dong; Yue Li; Haoyuan Shi; Zhiwei Xiong

arXiv:2511.18920·cs.CV·November 25, 2025

EventSTU: Event-Guided Efficient Spatio-Temporal Understanding for Video Large Language Models

Wenhao Xu, Xin Dong, Yue Li, Haoyuan Shi, Zhiwei Xiong

PDF

Open Access

TL;DR

EventSTU introduces an event-guided, training-free framework for efficient video understanding that reduces computational costs by exploiting event-based cues for spatio-temporal sampling and pruning, supported by a new multimodal benchmark.

Contribution

It proposes a novel event-guided, training-free approach for efficient spatio-temporal video understanding, including algorithms for keyframe sampling and token pruning, and introduces the EventBench benchmark.

Findings

01

Achieves 3.01x FLOPs reduction

02

Attains 3.10x speedup in prefilling

03

Improves performance over baseline

Abstract

Video large language models have demonstrated strong video understanding capabilities but suffer from high inference costs due to the massive number of tokens in long videos. Inspired by event-based vision, we propose an event-guided, training-free framework for efficient spatio-temporal understanding, named EventSTU. In the temporal domain, we design a coarse-to-fine keyframe sampling algorithm that exploits the change-triggered property of event cameras to eliminate redundant frames. In the spatial domain, we design an adaptive token pruning algorithm that leverages the visual saliency of events as a zero-cost prior to guide spatial reduction. From a holistic spatio-temporal perspective, we further integrate question relevance from keyframe sampling to adaptively allocate token pruning budgets. To facilitate evaluation, we construct EventBench, the first event-inclusive,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis