EventSTU: Event-Guided Efficient Spatio-Temporal Understanding for Video Large Language Models
Wenhao Xu, Xin Dong, Yue Li, Haoyuan Shi, Zhiwei Xiong

TL;DR
EventSTU introduces an event-guided, training-free framework for efficient video understanding that reduces computational costs by exploiting event-based cues for spatio-temporal sampling and pruning, supported by a new multimodal benchmark.
Contribution
It proposes a novel event-guided, training-free approach for efficient spatio-temporal video understanding, including algorithms for keyframe sampling and token pruning, and introduces the EventBench benchmark.
Findings
Achieves 3.01x FLOPs reduction
Attains 3.10x speedup in prefilling
Improves performance over baseline
Abstract
Video large language models have demonstrated strong video understanding capabilities but suffer from high inference costs due to the massive number of tokens in long videos. Inspired by event-based vision, we propose an event-guided, training-free framework for efficient spatio-temporal understanding, named EventSTU. In the temporal domain, we design a coarse-to-fine keyframe sampling algorithm that exploits the change-triggered property of event cameras to eliminate redundant frames. In the spatial domain, we design an adaptive token pruning algorithm that leverages the visual saliency of events as a zero-cost prior to guide spatial reduction. From a holistic spatio-temporal perspective, we further integrate question relevance from keyframe sampling to adaptively allocate token pruning budgets. To facilitate evaluation, we construct EventBench, the first event-inclusive,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis
