TrajTok: Learning Trajectory Tokens enables better Video Understanding

Chenhao Zheng; Jieyu Zhang; Jianing Zhang; Weikai Huang; Ashutosh Kumar; Quan Kong; Oncel Tuzel; Chun-Liang Li; Ranjay Krishna

arXiv:2602.22779·cs.CV·May 12, 2026

TrajTok: Learning Trajectory Tokens enables better Video Understanding

Chenhao Zheng, Jieyu Zhang, Jianing Zhang, Weikai Huang, Ashutosh Kumar, Quan Kong, Oncel Tuzel, Chun-Liang Li, Ranjay Krishna

PDF

TL;DR

TrajTok introduces an end-to-end, adaptive video tokenizer that improves efficiency and performance in video understanding tasks by directly producing object trajectories through implicit clustering.

Contribution

It presents a fully integrated, co-trained video tokenizer that adapts token granularity to semantic complexity, enhancing video understanding without external segmentation pipelines.

Findings

01

TrajViT2 achieves top accuracy in classification and retrieval benchmarks.

02

TrajTok improves efficiency comparable to token-merging methods.

03

It enhances long-video reasoning in vision-language models.

Abstract

Tokenization in video models, typically through patchification, generates an excessive and redundant number of tokens. This severely limits video efficiency and scalability. While recent trajectory-based tokenizers offer a promising solution by decoupling video duration from token count, they rely on complex external segmentation and tracking pipelines that are slow and task-agnostic. We propose TrajTok, an end-to-end video tokenizer module that is fully integrated and co-trained with video models for a downstream objective, dynamically adapting its token granularity to semantic complexity, independent of video duration. TrajTok contains a unified segmenter that performs implicit clustering over pixels in both space and time to directly produce object trajectories in a single forward pass. By prioritizing downstream adaptability over pixel-perfect segmentation fidelity, TrajTok is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.