Tango: Taming Visual Signals for Efficient Video Large Language Models

Shukang Yin; Sirui Zhao; Hanchao Wang; Baozhi Jia; Xianquan Wang; Chaoyou Fu; Enhong Chen

arXiv:2604.09547·cs.CV·April 14, 2026

Tango: Taming Visual Signals for Efficient Video Large Language Models

Shukang Yin, Sirui Zhao, Hanchao Wang, Baozhi Jia, Xianquan Wang, Chaoyou Fu, Enhong Chen

PDF

TL;DR

Tango introduces a novel token pruning framework for Video LLMs that improves efficiency and performance by addressing limitations in attention-based selection and clustering, using diversity strategies and spatial-temporal embeddings.

Contribution

The paper proposes Tango, a new token pruning method that enhances visual signal utilization through diversity-driven selection and ST-RoPE embeddings, advancing video understanding efficiency.

Findings

01

Retaining only 10% of tokens, Tango preserves 98.9% of original performance.

02

Tango achieves a 1.88× inference speedup.

03

Effective across various Video LLMs and benchmarks.

Abstract

Token pruning has emerged as a mainstream approach for developing efficient Video Large Language Models (Video LLMs). This work revisits and advances the two predominant token-pruning paradigms: attention-based selection and similarity-based clustering. Our study reveals two critical limitations in existing methods: (1) conventional top-k selection strategies fail to fully account for the attention distribution, which is often spatially multi-modal and long-tailed in magnitude; and (2) direct similarity-based clustering frequently generates fragmented clusters, resulting in distorted representations after pooling. To address these bottlenecks, we propose Tango, a novel framework designed to optimize the utilization of visual signals. Tango integrates a diversity-driven strategy to enhance attention-based token selection, and introduces Spatio-temporal Rotary Position Embedding (ST-RoPE)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.