TrajTok: Learning Trajectory Tokens enables better Video Understanding
Chenhao Zheng, Jieyu Zhang, Jianing Zhang, Weikai Huang, Ashutosh Kumar, Quan Kong, Oncel Tuzel, Chun-Liang Li, Ranjay Krishna

TL;DR
TrajTok introduces an end-to-end, adaptive video tokenizer that improves efficiency and performance in video understanding tasks by directly producing object trajectories through implicit clustering.
Contribution
It presents a fully integrated, co-trained video tokenizer that adapts token granularity to semantic complexity, enhancing video understanding without external segmentation pipelines.
Findings
TrajViT2 achieves top accuracy in classification and retrieval benchmarks.
TrajTok improves efficiency comparable to token-merging methods.
It enhances long-video reasoning in vision-language models.
Abstract
Tokenization in video models, typically through patchification, generates an excessive and redundant number of tokens. This severely limits video efficiency and scalability. While recent trajectory-based tokenizers offer a promising solution by decoupling video duration from token count, they rely on complex external segmentation and tracking pipelines that are slow and task-agnostic. We propose TrajTok, an end-to-end video tokenizer module that is fully integrated and co-trained with video models for a downstream objective, dynamically adapting its token granularity to semantic complexity, independent of video duration. TrajTok contains a unified segmenter that performs implicit clustering over pixels in both space and time to directly produce object trajectories in a single forward pass. By prioritizing downstream adaptability over pixel-perfect segmentation fidelity, TrajTok is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
