PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation

Onkar Susladkar; Tushar Prakash; Adheesh Juvekar; Kiet A. Nguyen; Dong-Hwan Jang; Inderjit S Dhillon; Ismini Lourentzou

arXiv:2601.16210·cs.CV·February 24, 2026

PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation

Onkar Susladkar, Tushar Prakash, Adheesh Juvekar, Kiet A. Nguyen, Dong-Hwan Jang, Inderjit S Dhillon, Ismini Lourentzou

PDF

Open Access 1 Models

TL;DR

PyraTok is a novel pyramidal tokenizer that learns multi-scale, semantically structured video tokens aligned with language, significantly enhancing zero-shot video understanding and generation across various benchmarks.

Contribution

It introduces LaPQ, a new multi-scale quantization method, and demonstrates improved cross-modal alignment and state-of-the-art zero-shot performance in video tasks.

Findings

01

Achieves state-of-the-art video reconstruction results.

02

Improves text-to-video quality across benchmarks.

03

Sets new zero-shot performance records in video segmentation and action localization.

Abstract

Discrete video VAEs underpin modern text-to-video generation and video understanding systems, yet existing tokenizers typically learn visual codebooks at a single scale with limited vocabularies and shallow language supervision, leading to poor cross-modal alignment and zero-shot transfer. We introduce PyraTok, a language-aligned pyramidal tokenizer that learns semantically structured discrete latents across multiple spatiotemporal resolutions. PyraTok builds on a pretrained video VAE and a novel Language aligned Pyramidal Quantization (LaPQ) module that discretizes encoder features at several depths using a shared large binary codebook, yielding compact yet expressive video token sequences. To tightly couple visual tokens with language, PyraTok jointly optimizes multi-scale text-guided quantization and a global autoregressive objective over the token hierarchy. Across ten benchmarks,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
onkarsus13/PyraTok
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Human Pose and Action Recognition