TerDiT: Ternary Diffusion Models with Transformers
Xudong Lu, Aojun Zhou, Ziyi Lin, Qi Liu, Yuhui Xu, Renrui Zhang, Xue, Yang, Junchi Yan, Peng Gao, Hongsheng Li

TL;DR
This paper introduces TerDiT, a quantization-aware training scheme for ternarizing large diffusion transformer models, enabling efficient deployment with minimal performance loss in high-fidelity image generation.
Contribution
It presents the first ternarization and low-bit deployment method for diffusion transformer models, significantly reducing model size and computational costs.
Findings
Low-bit TerDiT models maintain competitive image quality.
Effective quantization-aware training enables from-scratch low-bit diffusion models.
Demonstrates feasibility of deploying large-scale DiT models efficiently.
Abstract
Recent developments in large-scale pre-trained text-to-image diffusion models have significantly improved the generation of high-fidelity images, particularly with the emergence of diffusion transformer models (DiTs). Among diffusion models, diffusion transformers have demonstrated superior image-generation capabilities, boosting lower FID scores and higher scalability. However, deploying large-scale DiT models can be expensive due to their excessive parameter numbers. Although existing research has explored efficient deployment techniques for diffusion models, such as model quantization, there is still little work concerning DiT-based models. To tackle this research gap, we propose TerDiT, the first quantization-aware training (QAT) and efficient deployment scheme for extremely low-bit diffusion transformer models. We focus on the ternarization of DiT networks, with model sizes ranging…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. The paper is the first to utilize Ternary blocks in Diffusion transformers for quantization aware training. 2. The paper identifies a major problem in post training quantization of DiT and illustrates the drastic drop in performance that happens when about 1-2 bits of post training based quantization is used. 3. The paper identifies a major problem that happens in low bit quantization aware training using Ternary blocks, which is the high activation values and proposes a simple fix based on L
1. The method lacks in novelty and the main contribution of the paper is in utilizing the ideas in [1] for replacing the attention and feedforward blocks and [2] for introducing RMSNorm for stable training process. 2. The results in Figure 6 are surprising since the training losses with and without RMSNorm converge to similar values but the FID performance shows a drastic difference. Could the authors comment further on this. 3. In Figure 6, could the authors also please provide the results wit
1. This is the first attempt to ternarize DiT, achieving generative results that can even rival those of full-precision models. 2. Compared to BitNet b1.58, the manuscript selectively applies RMS Norm based on the structural characteristics of DiT. 3. Comparative results across multiple models confirm the effectiveness of TerDiT.
1. The paper lacks a clear motivation for applying a ternary quantizer to DiT. Given effective precedents for binary DM and 2-bit DM, I believe evidence is needed to demonstrate that a ternary DM can significantly outperform binary DM in accuracy or achieve substantial efficiency gains over 2-bit DM in practical use. 2. The issue of increased activations does not necessarily seem to be an inherent problem of ternarization. For example, better initialization of α in Equation 2 might straightforw
- Competent QAT method for Diffusion Transformers. - QAT jumps to very low bit, ternary quantization. - Experimental results, when presented, are convincing. - The paper is well-written, easy to understand, and competently presented. - Visual results per Fig. 4 and otherwise are impressive.
- The evaluation on ImageNet is incomplete. Specifically, the authors consider two resolutions (256 and 512, corresponding to the original DiT) and two model sizes (600M param XL/2 and a sized-up 4.2B parameter version), but their application of the two models is inconsistent across experiments, e.g., the 600M model is only used for 256x256 for CFG in Table 1. - Table 1 results are decently convincing but not great, e.g., L337 TerDiT-600M loses substantially to the original DiT-XL/2 (almost dou
- The proposed method achieves the ternarization of DiT model through the QAT technique, enhancing the efficiency of the DiT model. - The additional RMSNorm after AdaLN to address the issue of large activation values is simple and efficient.
- The technical contribution of this paper appears to be merely the introduction of an extra RMSNorm. The CUDA kernel deployment is from previous work, and the model architecture follows the previous Large-DiT. Therefore, I rise concerns about the technical contribution of this paper. - The models architecture evaluated in this work are somewhat limited. For example, previous work on diffusion model quantization included various diffusion architectures, such as LDM, StableDiffusion, etc. Howev
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTheoretical and Computational Physics
MethodsFocus · Diffusion
