Unified Spatio-Temporal Token Scoring for Efficient Video VLMs

Jianrui Zhang; Yue Yang; Rohun Tripathi; Winson Han; Ranjay Krishna; Christopher Clark; Yong Jae Lee; Sangho Lee

arXiv:2603.18004·cs.CV·March 19, 2026

Unified Spatio-Temporal Token Scoring for Efficient Video VLMs

Jianrui Zhang, Yue Yang, Rohun Tripathi, Winson Han, Ranjay Krishna, Christopher Clark, Yong Jae Lee, Sangho Lee

PDF

Open Access

TL;DR

This paper introduces STTS, a lightweight, unified token pruning method for video vision-language models that significantly improves efficiency with minimal performance loss across various video QA tasks.

Contribution

STTS is a novel, architecture-wide token scoring and pruning method that operates across both ViT and LLM without complex conditioning, enhancing efficiency in video VLMs.

Findings

01

Prunes 50% of vision tokens with 62% efficiency gain

02

Achieves only 0.7% performance drop across 13 video QA tasks

03

Test-time scaling further improves long-video QA performance

Abstract

Token pruning is essential for enhancing the computational efficiency of vision-language models (VLMs), particularly for video-based tasks where temporal redundancy is prevalent. Prior approaches typically prune tokens either (1) within the vision transformer (ViT) exclusively for unimodal perception tasks such as action recognition and object segmentation, without adapting to downstream vision-language tasks; or (2) only within the LLM while leaving the ViT output intact, often requiring complex text-conditioned token selection mechanisms. In this paper, we introduce Spatio-Temporal Token Scoring (STTS), a simple and lightweight module that prunes vision tokens across both the ViT and the LLM without text conditioning or token merging, and is fully compatible with end-to-end training. By learning how to score temporally via an auxiliary loss and spatially via LLM downstream gradients,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition