Efficient Long Video Tokenization via Coordinate-based Patch Reconstruction
Huiwon Jang, Sihyun Yu, Jinwoo Shin, Pieter Abbeel, Younggyo Seo

TL;DR
CoordTok introduces a coordinate-based video tokenizer that efficiently encodes long videos into fewer tokens, enabling memory-efficient training of long-video generative models.
Contribution
The paper presents CoordTok, a novel coordinate-based tokenizer that reduces token count for long videos, facilitating efficient training without high computational costs.
Findings
Reduces tokens for 128-frame videos from over 6000 to 1280.
Enables memory-efficient training of a 128-frame diffusion transformer.
Maintains high reconstruction quality with fewer tokens.
Abstract
Efficient tokenization of videos remains a challenge in training vision models that can process long videos. One promising direction is to develop a tokenizer that can encode long video clips, as it would enable the tokenizer to leverage the temporal coherence of videos better for tokenization. However, training existing tokenizers on long videos often incurs a huge training cost as they are trained to reconstruct all the frames at once. In this paper, we introduce CoordTok, a video tokenizer that learns a mapping from coordinate-based representations to the corresponding patches of input videos, inspired by recent advances in 3D generative models. In particular, CoordTok encodes a video into factorized triplane representations and reconstructs patches that correspond to randomly sampled coordinates. This allows for training large tokenizer models directly on long videos…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Media Forensic Detection · Advanced Steganography and Watermarking Techniques · Image and Video Stabilization
MethodsDiffusion
