RefTok: Reference-Based Tokenization for Video Generation

Xiang Fan; Xiaohang Sun; Kushan Thakkar; Zhu Liu; Vimal Bhat; Ranjay Krishna; Xiang Hao

arXiv:2507.02862·cs.CV·July 4, 2025

RefTok: Reference-Based Tokenization for Video Generation

Xiang Fan, Xiaohang Sun, Kushan Thakkar, Zhu Liu, Vimal Bhat, Ranjay Krishna, Xiang Hao

PDF

Open Access

TL;DR

RefTok introduces a reference-based tokenization approach that effectively captures temporal dependencies in videos, significantly improving compression and generation quality across multiple datasets compared to existing methods.

Contribution

RefTok is a novel reference-based tokenization method that encodes and decodes video frames conditioned on an unquantized reference, enhancing temporal modeling and compression.

Findings

01

Outperforms state-of-the-art tokenizers by 36.7% in metrics.

02

Improves video generation quality by 27.9% over larger models.

03

Maintains details and continuity across diverse video datasets.

Abstract

Effectively handling temporal redundancy remains a key challenge in learning video models. Prevailing approaches often treat each set of frames independently, failing to effectively capture the temporal dependencies and redundancies inherent in videos. To address this limitation, we introduce RefTok, a novel reference-based tokenization method capable of capturing complex temporal dynamics and contextual information. Our method encodes and decodes sets of frames conditioned on an unquantized reference frame. When decoded, RefTok preserves the continuity of motion and the appearance of objects across frames. For example, RefTok retains facial details despite head motion, reconstructs text correctly, preserves small patterns, and maintains the legibility of handwriting from the context. Across 4 video datasets (K600, UCF-101, BAIR Robot Pushing, and DAVIS), RefTok significantly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Multimodal Machine Learning Applications