Language-Guided Temporal Token Pruning for Efficient VideoLLM Processing

Yogesh Kumar

arXiv:2508.17686·cs.CV·August 26, 2025

Language-Guided Temporal Token Pruning for Efficient VideoLLM Processing

Yogesh Kumar

PDF

TL;DR

This paper introduces LGTTP, a method that uses language cues to selectively prune video tokens in vision-language models, significantly reducing computation while maintaining high performance in video understanding tasks.

Contribution

LGTTP is a novel, model-agnostic framework that adaptively prunes video tokens based on temporal cues, improving efficiency without sacrificing accuracy.

Findings

01

65% reduction in computational cost

02

Retains 97-99% of original performance

03

Improves HIT@1 by +9.5% on QVHighlights

Abstract

Vision Language Models (VLMs) struggle with long-form videos due to the quadratic complexity of attention mechanisms. We propose Language-Guided Temporal Token Pruning (LGTTP), which leverages temporal cues from queries to adaptively prune video tokens, preserving contextual continuity while reducing computational overhead. Unlike uniform pruning or keyframe selection, LGTTP retains higher token density in temporally relevant segments. Our model-agnostic framework integrates with TimeChat and LLaVA-Video, achieving a 65% reduction in computation while preserving 97-99% of the original performance. On QVHighlights, LGTTP improves HIT@1 by +9.5%, and on Charades-STA, it retains 99.6% of R@1. It excels on queries with explicit temporal markers and remains effective across general video understanding tasks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.