6Bit-Diffusion: Inference-Time Mixed-Precision Quantization for Video Diffusion Models
Rundong Su, Jintao Zhang, Zhihang Yuan, Haojie Duanmu, Jianfei Chen, Jun Zhu

TL;DR
This paper introduces a dynamic mixed-precision quantization framework and a temporal redundancy exploitation technique to significantly improve the efficiency of video diffusion models during inference, reducing memory and computation costs.
Contribution
It proposes a novel adaptive mixed-precision quantization method and a temporal delta cache mechanism for efficient video diffusion model inference.
Findings
Achieves 1.92× end-to-end acceleration
Reduces memory usage by 3.32×
Maintains high generation quality
Abstract
Diffusion transformers have demonstrated remarkable capabilities in generating videos. However, their practical deployment is severely constrained by high memory usage and computational cost. Post-Training Quantization provides a practical way to reduce memory usage and boost computation speed. Existing quantization methods typically apply a static bit-width allocation, overlooking the quantization difficulty of activations across diffusion timesteps, leading to a suboptimal trade-off between efficiency and quality. In this paper, we propose a inference time NVFP4/INT8 Mixed-Precision Quantization framework. We find a strong linear correlation between a block's input-output difference and the quantization sensitivity of its internal linear layers. Based on this insight, we design a lightweight predictor that dynamically allocates NVFP4 to temporally stable layers to maximize memory…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage and Video Quality Assessment · Video Coding and Compression Technologies · Advanced Data Compression Techniques
