Language-Guided Temporal Token Pruning for Efficient VideoLLM Processing
Yogesh Kumar

TL;DR
This paper introduces LGTTP, a method that uses language cues to selectively prune video tokens in vision-language models, significantly reducing computation while maintaining high performance in video understanding tasks.
Contribution
LGTTP is a novel, model-agnostic framework that adaptively prunes video tokens based on temporal cues, improving efficiency without sacrificing accuracy.
Findings
65% reduction in computational cost
Retains 97-99% of original performance
Improves HIT@1 by +9.5% on QVHighlights
Abstract
Vision Language Models (VLMs) struggle with long-form videos due to the quadratic complexity of attention mechanisms. We propose Language-Guided Temporal Token Pruning (LGTTP), which leverages temporal cues from queries to adaptively prune video tokens, preserving contextual continuity while reducing computational overhead. Unlike uniform pruning or keyframe selection, LGTTP retains higher token density in temporally relevant segments. Our model-agnostic framework integrates with TimeChat and LLaVA-Video, achieving a 65% reduction in computation while preserving 97-99% of the original performance. On QVHighlights, LGTTP improves HIT@1 by +9.5%, and on Charades-STA, it retains 99.6% of R@1. It excels on queries with explicit temporal markers and remains effective across general video understanding tasks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
