MambaVideo for Discrete Video Tokenization with Channel-Split Quantization

Dawit Mureja Argaw; Xian Liu; Joon Son Chung; Ming-Yu Liu; Fitsum Reda

arXiv:2507.04559·cs.CV·July 8, 2025

MambaVideo for Discrete Video Tokenization with Channel-Split Quantization

Dawit Mureja Argaw, Xian Liu, Joon Son Chung, Ming-Yu Liu, Fitsum Reda

PDF

TL;DR

This paper presents MambaVideo, a novel discrete video tokenizer with a Mamba-based encoder-decoder and channel-split quantization, achieving state-of-the-art performance in autoregressive video modeling.

Contribution

Introduces a Mamba-based architecture and channel-split quantization scheme that improve discrete video tokenization over previous methods.

Findings

01

Outperforms causal 3D convolution and Transformer-based models

02

Sets new state-of-the-art across multiple datasets

03

Demonstrates robustness in autoregressive video generation

Abstract

Discrete video tokenization is essential for efficient autoregressive generative modeling due to the high dimensionality of video data. This work introduces a state-of-the-art discrete video tokenizer with two key contributions. First, we propose a novel Mamba-based encoder-decoder architecture that overcomes the limitations of previous sequencebased tokenizers. Second, we introduce a new quantization scheme, channel-split quantization, which significantly enhances the representational power of quantized latents while preserving the token count. Our model sets a new state-of-the-art, outperforming both causal 3D convolutionbased and Transformer-based approaches across multiple datasets. Experimental results further demonstrate its robustness as a tokenizer for autoregressive video generation.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.