SemanticDialect: Semantic-Aware Mixed-Format Quantization for Video Diffusion Transformers

Wonsuk Jang; Thierry Tambe

arXiv:2603.02883·cs.CV·May 11, 2026

SemanticDialect: Semantic-Aware Mixed-Format Quantization for Video Diffusion Transformers

Wonsuk Jang, Thierry Tambe

PDF

TL;DR

SemanticDialect introduces a semantic-aware mixed-format quantization method for Video Diffusion Transformers, reducing memory and computation costs while maintaining high video quality suitable for edge deployment.

Contribution

It proposes a novel block-wise mixed-format quantization framework with semantic-aware format assignment and attention-guided activation decomposition.

Findings

01

Outperforms prior quantization methods and block-wise formats in quality.

02

Approaches FP16 quality on Open-Sora 2.0.

03

Validated hardware deployability through RTL and GPU implementations.

Abstract

Diffusion Transformers (DiTs) achieve state-of-the-art video generation quality, but their substantial memory and computational footprints hinder edge deployment. Quantization can reduce these costs, yet existing methods often degrade video quality due to high activation variation and the difficulty of preserving semantic and temporal coherence. We propose SemanticDialect, which advances block-wise mixed-format quantization. In this framework, each block selects an optimal format (dialect) from a candidate set (formatbook), which is augmented with lookup tables that store quantization errors and quantized indices, enabling efficient per-block format selection and quantization with minimal online overhead. We further introduce attention-guided activation decomposition, which reduces quantization error via residual quantization, and semantic-aware dialect assignment (SeDA), which reduces…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.