Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models

Jinlong Li; Liyuan Jiang; Haonan Zhang; Nicu Sebe

arXiv:2603.01400·cs.CV·April 13, 2026

Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models

Jinlong Li, Liyuan Jiang, Haonan Zhang, Nicu Sebe

PDF

2 Repos

TL;DR

This paper introduces a novel token reduction method for Video Large Language Models using local and global context optimization via optimal transport, significantly improving efficiency while maintaining performance.

Contribution

It proposes a training-free token anchoring and aggregation approach with local-global optimal transport for efficient video understanding in LLMs.

Findings

01

Achieves substantial computational efficiency on various video benchmarks.

02

Maintains high temporal and visual fidelity despite token reduction.

03

Outperforms existing pruning methods in spatiotemporal reduction.

Abstract

Video Large Language Models (VLLMs) demonstrate strong video understanding but suffer from inefficiency due to redundant visual tokens. Existing pruning primary targets intra-frame spatial redundancy or prunes inside the LLM with shallow-layer overhead, yielding suboptimal spatiotemporal reduction and underutilizing long-context compressibility. All of them often discard subtle yet informative context from merged or pruned tokens. In this paper, we propose a new perspective that elaborates token \textbf{A}nchors within intra-frame and inter-frame to comprehensively aggregate the informative contexts via local-global \textbf{O}ptimal \textbf{T}ransport (\textbf{AOT}). Specifically, we first establish local- and global-aware token anchors within each frame under the attention guidance, which then optimal transport aggregates the informative contexts from pruned tokens, constructing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.