A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens

Tommie Kerssies; Gabriele Berton; Ju He; Qihang Yu; Wufei Ma; Daan de Geus; Gijs Dubbelman; and Liang-Chieh Chen

arXiv:2604.04913·cs.CV·April 7, 2026

A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens

Tommie Kerssies, Gabriele Berton, Ju He, Qihang Yu, Wufei Ma, Daan de Geus, Gijs Dubbelman, and Liang-Chieh Chen

PDF

2 Repos 6 Models

TL;DR

DeltaTok introduces a tokenization method that encodes frame differences for efficient, diverse future video prediction, significantly reducing model size and computational cost.

Contribution

It presents DeltaTok, a novel tokenizer for feature differences, enabling a generative world model that is more efficient and capable of multi-hypothesis forecasting.

Findings

01

Achieves over 35x fewer parameters than existing models.

02

Uses 2000x fewer FLOPs for dense forecasting tasks.

03

Produces futures that better match real-world outcomes.

Abstract

Anticipating diverse future states is a central challenge in video world modeling. Discriminative world models produce a deterministic prediction that implicitly averages over possible futures, while existing generative world models remain computationally expensive. Recent work demonstrates that predicting the future in the feature space of a vision foundation model (VFM), rather than a latent space optimized for pixel reconstruction, requires significantly fewer world model parameters. However, most such approaches remain discriminative. In this work, we introduce DeltaTok, a tokenizer that encodes the VFM feature difference between consecutive frames into a single continuous "delta" token, and DeltaWorld, a generative world model operating on these tokens to efficiently generate diverse plausible futures. Delta tokens reduce video from a three-dimensional spatio-temporal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.