TL;DR
VideoFlexTok introduces a variable-length, coarse-to-fine video tokenization method that improves efficiency and scalability for generative video models by capturing abstract to detailed information.
Contribution
It proposes a novel flexible-length tokenization approach with a generative flow decoder, enabling efficient, long, and adaptable video representations for downstream tasks.
Findings
Achieves comparable quality with 5x smaller models.
Enables long video generation with fewer tokens.
Improves training efficiency over traditional 3D grid tokens.
Abstract
Visual tokenizers map high-dimensional raw pixels into a compressed representation for downstream modeling. Beyond compression, tokenizers dictate what information is preserved and how it is organized. A de facto standard approach to video tokenization is to represent a video as a spatiotemporal 3D grid of tokens, each capturing the corresponding local information in the original signal. This requires the downstream model that consumes the tokens, e.g., a text-to-video model, to learn to predict all low-level details "pixel-by-pixel" irrespective of the video's inherent complexity, leading to high learning complexity. We present VideoFlexTok, which represents videos with a variable-length sequence of tokens structured in a coarse-to-fine manner -- where the first tokens (emergently) capture abstract information, such as semantics and motion, and later tokens add fine-grained details.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
