AdaptToken: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding

Haozhe Qi; Kevin Qu; Mahdi Rad; Rui Wang; Alexander Mathis; Marc Pollefeys

arXiv:2603.28696·cs.CV·March 31, 2026

AdaptToken: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding

Haozhe Qi, Kevin Qu, Mahdi Rad, Rui Wang, Alexander Mathis, Marc Pollefeys

PDF

2 Repos

TL;DR

AdaptToken is a novel, training-free framework that uses model uncertainty to adaptively select tokens in long videos, improving efficiency and accuracy for multi-modal large language models.

Contribution

It introduces a principled, entropy-based token selection method that enables global relevance comparison and early stopping in long-video understanding tasks.

Findings

01

AdaptToken improves accuracy by an average of 6.7 points over baseline models.

02

AdaptToken-Lite halves inference time with comparable performance.

03

The method scales effectively to inputs up to 10,000 frames.

Abstract

Long video understanding remains challenging for Multi-modal Large Language Models (MLLMs) due to high memory costs and context-length limits. Prior approaches mitigate this by scoring and selecting frames/tokens within short clips, but they lack a principled mechanism to (i) compare relevance across distant video clips and (ii) stop processing once sufficient evidence has been gathered. We propose AdaptToken, a training-free framework that turns an MLLM's self-uncertainty into a global control signal for long-video token selection. AdaptToken splits a video into groups, extracts cross-modal attention to rank tokens within each group, and uses the model's response entropy to estimate each group's prompt relevance. This entropy signal enables a global token budget allocation across groups and further supports early stopping (AdaptToken-Lite), skipping the remaining groups when the model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.