TL;DR
AdaptToken is a novel, training-free framework that uses model uncertainty to adaptively select tokens in long videos, improving efficiency and accuracy for multi-modal large language models.
Contribution
It introduces a principled, entropy-based token selection method that enables global relevance comparison and early stopping in long-video understanding tasks.
Findings
AdaptToken improves accuracy by an average of 6.7 points over baseline models.
AdaptToken-Lite halves inference time with comparable performance.
The method scales effectively to inputs up to 10,000 frames.
Abstract
Long video understanding remains challenging for Multi-modal Large Language Models (MLLMs) due to high memory costs and context-length limits. Prior approaches mitigate this by scoring and selecting frames/tokens within short clips, but they lack a principled mechanism to (i) compare relevance across distant video clips and (ii) stop processing once sufficient evidence has been gathered. We propose AdaptToken, a training-free framework that turns an MLLM's self-uncertainty into a global control signal for long-video token selection. AdaptToken splits a video into groups, extracts cross-modal attention to rank tokens within each group, and uses the model's response entropy to estimate each group's prompt relevance. This entropy signal enables a global token budget allocation across groups and further supports early stopping (AdaptToken-Lite), skipping the remaining groups when the model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
