One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory
Chenhao Zheng, Jieyu Zhang, Mohammadreza Salehi, Ziqi Gao, Vishnu Iyengar, Norimasa Kobori, Quan Kong, Ranjay Krishna

TL;DR
This paper introduces grounded video tokenization using object trajectories, significantly reducing tokens and computational costs while improving performance in video understanding tasks.
Contribution
The authors propose TrajViT, a novel video encoder that leverages panoptic sub-object trajectories for efficient and effective video tokenization, outperforming existing methods.
Findings
TrajViT outperforms ViT3D by 6% top-5 recall in video-text retrieval.
TrajViT achieves 5.2% better performance on VideoQA benchmarks.
TrajViT trains 4x faster and uses 18x less inference FLOPs than ViT3D.
Abstract
Effective video tokenization is critical for scaling transformer models for long videos. Current approaches tokenize videos using space-time patches, leading to excessive tokens and computational inefficiencies. The best token reduction strategies degrade performance and barely reduce the number of tokens when the camera moves. We introduce grounded video tokenization, a paradigm that organizes tokens based on panoptic sub-object trajectories rather than fixed patches. Our method aligns with fundamental perceptual principles, ensuring that tokenization reflects scene complexity rather than video duration. We propose TrajViT, a video encoder that extracts object trajectories and converts them into semantically meaningful tokens, significantly reducing redundancy while maintaining temporal coherence. Trained with contrastive learning, TrajViT significantly outperforms space-time ViT…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
