EventPrune: Cascaded Event-Assisted Token Pruning for Efficient First-Person Dynamic Spatial Reasoning
Pengtao Ma, Ziliang Zhou, Ciyu Ruan, Haoyang Wang, Kaiyuan Li, Zihang Gong, Wenhua Ding, Chen Gao, Jingao Xu, Xinlei Chen

TL;DR
EventPrune introduces a novel, training-free event-guided token pruning framework that significantly reduces computational costs while enhancing spatial reasoning accuracy in first-person video analysis.
Contribution
It is the first to leverage high-frequency motion cues from event cameras for token pruning, improving efficiency and accuracy in first-person dynamic spatial reasoning.
Findings
Achieves 80% token reduction with better accuracy than full-token baseline.
Provides 1.89x inference speedup and 52% GFLOPs reduction.
Introduces ESR-Real, a new RGB-event benchmark for spatial reasoning.
Abstract
First-person dynamic spatial reasoning requires models to track continuous motion and precise geometric structure, but the quadratic attention cost of Transformer-based Video-LLMs makes dense visual tokens computationally expensive. Existing token pruning paradigms predominantly rely on discrete static snapshots, failing to preserve the motion and geometric cues essential for reasoning. We propose Event Cascade Pruning (ECP), to our knowledge the first training-free framework that leverages the high-frequency motion cues from event cameras as a continuous event-guided motion prior to guide token selection. ECP combines three stages: Event-Triggered Causal Sampling to anchor motion-informative keyframes, Event-guided Motion Saliency Filtering to suppress event-inactive visual tokens, and Event-Attention Ranking Fusion to calibrate spatial attention with motion-salient dynamics. With 80%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
