VTok: A Unified Video Tokenizer with Decoupled Spatial-Temporal Latents

Feng Wang; Yichun Shi; Ceyuan Yang; Qiushan Guo; Jingxiang Sun; Alan Yuille; Peng Wang

arXiv:2602.04202·cs.CV·February 5, 2026

VTok: A Unified Video Tokenizer with Decoupled Spatial-Temporal Latents

Feng Wang, Yichun Shi, Ceyuan Yang, Qiushan Guo, Jingxiang Sun, Alan Yuille, Peng Wang

PDF

Open Access

TL;DR

VTok introduces a novel video tokenization method that decouples spatial and temporal features, leading to more efficient and effective video understanding and generation.

Contribution

It proposes a unified framework that separately encodes spatial features and residual temporal changes, improving over naive frame-sampling strategies.

Findings

01

Higher accuracy on video understanding benchmarks

02

More coherent motion in text-to-video generation

03

Shorter token sequences per video

Abstract

This work presents VTok, a unified video tokenization framework that can be used for both generation and understanding tasks. Unlike the leading vision-language systems that tokenize videos through a naive frame-sampling strategy, we propose to decouple the spatial and temporal representations of videos by retaining the spatial features of a single key frame while encoding each subsequent frame into a single residual token, achieving compact yet expressive video tokenization. Our experiments suggest that VTok effectively reduces the complexity of video representation from the product of frame count and per-frame token count to their sum, while the residual tokens sufficiently capture viewpoint and motion changes relative to the key frame. Extensive evaluations demonstrate the efficacy and efficiency of VTok: it achieves notably higher performance on a range of video understanding and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Video Analysis and Summarization