VideoFlexTok: Flexible-Length Coarse-to-Fine Video Tokenization

Andrei Atanov; Jesse Allardice; Roman Bachmann; O\u{g}uzhan Fatih Kar; R Devon Hjelm; David Griffiths; Peter Fu; Afshin Dehghan; Amir Zamir

arXiv:2604.12887·cs.CV·April 15, 2026

VideoFlexTok: Flexible-Length Coarse-to-Fine Video Tokenization

Andrei Atanov, Jesse Allardice, Roman Bachmann, O\u{g}uzhan Fatih Kar, R Devon Hjelm, David Griffiths, Peter Fu, Afshin Dehghan, Amir Zamir

PDF

1 Repo 2 Models

TL;DR

VideoFlexTok introduces a variable-length, coarse-to-fine video tokenization method that improves efficiency and scalability for generative video models by capturing abstract to detailed information.

Contribution

It proposes a novel flexible-length tokenization approach with a generative flow decoder, enabling efficient, long, and adaptable video representations for downstream tasks.

Findings

01

Achieves comparable quality with 5x smaller models.

02

Enables long video generation with fewer tokens.

03

Improves training efficiency over traditional 3D grid tokens.

Abstract

Visual tokenizers map high-dimensional raw pixels into a compressed representation for downstream modeling. Beyond compression, tokenizers dictate what information is preserved and how it is organized. A de facto standard approach to video tokenization is to represent a video as a spatiotemporal 3D grid of tokens, each capturing the corresponding local information in the original signal. This requires the downstream model that consumes the tokens, e.g., a text-to-video model, to learn to predict all low-level details "pixel-by-pixel" irrespective of the video's inherent complexity, leading to high learning complexity. We present VideoFlexTok, which represents videos with a variable-length sequence of tokens structured in a coarse-to-fine manner -- where the first tokens (emergently) capture abstract information, such as semantics and motion, and later tokens add fine-grained details.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

apple/ml-videoflextok
github

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.