EchoPrune: Interpreting Redundancy as Temporal Echoes for Efficient VideoLLMs
Jiameng Li, Minye Wu, Jiezhang Cao, Aleksei Tiulpin, Matthew B. Blaschko

TL;DR
EchoPrune is a lightweight, training-free token pruning method that interprets redundant video tokens as temporal echoes, enabling more efficient and accurate long-form video understanding with VideoLLMs.
Contribution
It introduces a novel approach to token pruning by leveraging the concept of temporal echoes, improving temporal resolution without additional computational overhead.
Findings
Enables processing up to 20x more frames under the same token budget.
Improves performance by +8.6% on multiple video understanding benchmarks.
Achieves 5.6x inference speedup for prefilled decoding.
Abstract
Long-form video understanding remains challenging for Video Large Language Models (VideoLLMs), as the dense frame sampling introduces massive visual tokens while sparse sampling risks missing critical temporal evidence and leading to LLM hallucination. Existing training-free token reduction methods either treat videos equally as static images or rely on segment-level merging heuristics, which weaken fine-grained spatiotemporal modeling and introduce additional overhead. In this paper, we propose EchoPrune, a lightweight and training-free token pruning method that improves temporal resolution under a fixed LLM-side visual token budget. Our core idea is to interpret redundant video tokens as temporal echoes: if a token is well reconstructed from the previous frame, it is merely a temporally redundant echo; otherwise, it may capture new events, motion, or query-relevant visual evidence.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
