SweetTok: Semantic-Aware Spatial-Temporal Tokenizer for Compact Video   Discretization

Zhentao Tan; Ben Xue; Jian Jia; Junhao Wang; Wencai Ye; Shaoyun Shi,; Mingjie Sun; Wenjin Wu; Quan Chen; Peng Jiang

arXiv:2412.10443·cs.CV·March 12, 2025

SweetTok: Semantic-Aware Spatial-Temporal Tokenizer for Compact Video Discretization

Zhentao Tan, Ben Xue, Jian Jia, Junhao Wang, Wencai Ye, Shaoyun Shi,, Mingjie Sun, Wenjin Wu, Quan Chen, Peng Jiang

PDF

Open Access

TL;DR

SweetTok introduces a semantic-aware spatial-temporal video tokenizer that efficiently compresses video data, capturing essential information and enabling improved reconstruction, generation, and few-shot recognition.

Contribution

It proposes a decoupled query autoencoder and a motion-enhanced codebook for superior video discretization and semantic encoding.

Findings

01

Achieves 42.8% improvement in video reconstruction on UCF-101.

02

Boosts downstream video generation by 15.1%.

03

Enables few-shot recognition with semantic tokens.

Abstract

This paper presents the \textbf{S}emantic-a\textbf{W}ar\textbf{E} spatial-t\textbf{E}mporal \textbf{T}okenizer (SweetTok), a novel video tokenizer to overcome the limitations in current video tokenization methods for compacted yet effective discretization. Unlike previous approaches that process flattened local visual patches via direct discretization or adaptive query tokenization, SweetTok proposes a decoupling framework, compressing visual inputs through distinct spatial and temporal queries via \textbf{D}ecoupled \textbf{Q}uery \textbf{A}uto\textbf{E}ncoder (DQAE). This design allows SweetTok to efficiently compress video token count while achieving superior fidelity by capturing essential information across spatial and temporal dimensions. Furthermore, we design a \textbf{M}otion-enhanced \textbf{L}anguage \textbf{C}odebook (MLC) tailored for spatial and temporal compression to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques

MethodsVQ-VAE