SemanticDialect: Semantic-Aware Mixed-Format Quantization for Video Diffusion Transformers
Wonsuk Jang, Thierry Tambe

TL;DR
SemanticDialect introduces a semantic-aware mixed-format quantization method for Video Diffusion Transformers, reducing memory and computation costs while maintaining high video quality suitable for edge deployment.
Contribution
It proposes a novel block-wise mixed-format quantization framework with semantic-aware format assignment and attention-guided activation decomposition.
Findings
Outperforms prior quantization methods and block-wise formats in quality.
Approaches FP16 quality on Open-Sora 2.0.
Validated hardware deployability through RTL and GPU implementations.
Abstract
Diffusion Transformers (DiTs) achieve state-of-the-art video generation quality, but their substantial memory and computational footprints hinder edge deployment. Quantization can reduce these costs, yet existing methods often degrade video quality due to high activation variation and the difficulty of preserving semantic and temporal coherence. We propose SemanticDialect, which advances block-wise mixed-format quantization. In this framework, each block selects an optimal format (dialect) from a candidate set (formatbook), which is augmented with lookup tables that store quantization errors and quantized indices, enabling efficient per-block format selection and quantization with minimal online overhead. We further introduce attention-guided activation decomposition, which reduces quantization error via residual quantization, and semantic-aware dialect assignment (SeDA), which reduces…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
