Seeing the Forest and the Trees: Query-Aware Tokenizer for Long-Video Multimodal Language Models

Siyou Li; Huanan Wu; Juexi Shao; Yinghao Ma; Yujian Gan; Yihao Luo; Yuwei Wang; Dong Nie; Lu Wang; Wenqing Wu; Le Zhang; Massimo Poesio; Juntao Yu

arXiv:2511.11910·cs.CV·February 26, 2026

Seeing the Forest and the Trees: Query-Aware Tokenizer for Long-Video Multimodal Language Models

Siyou Li, Huanan Wu, Juexi Shao, Yinghao Ma, Yujian Gan, Yihao Luo, Yuwei Wang, Dong Nie, Lu Wang, Wenqing Wu, Le Zhang, Massimo Poesio, Juntao Yu

PDF

Open Access 6 Models 1 Datasets

TL;DR

This paper introduces QTSplus, a query-aware token selection method that significantly reduces visual token processing in long-video multimodal models, maintaining high accuracy and efficiency.

Contribution

The paper proposes QTSplus, a novel, dynamic visual token selector that improves long-video understanding by reducing computational costs while preserving task-relevant information.

Findings

01

Compresses vision stream by up to 89%

02

Reduces end-to-end latency by 28%

03

Maintains near-parity accuracy on benchmarks

Abstract

Despite the recent advances in the video understanding ability of multimodal large language models (MLLMs), long video understanding remains a challenge. One of the main issues is that the number of vision tokens grows linearly with video length, which causes an explosion in attention cost, memory, and latency. To solve this challenge, we present Query-aware Token Selector (\textbf{QTSplus}), a lightweight yet powerful visual token selection module that serves as an information gate between the vision encoder and LLMs. Given a text query and video tokens, QTSplus dynamically selects the most important visual evidence for the input text query by (i) scoring visual tokens via cross-attention, (ii) \emph{predicting} an instance-specific retention budget based on the complexity of the query, and (iii) \emph{selecting} Top- $n$ tokens with a differentiable straight-through estimator during…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

AlpachinoNLP/QTSplus-Dataset
dataset· 451 dl
451 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning