ConvRot: Rotation-Based Plug-and-Play 4-bit Quantization for Diffusion Transformers
Feice Huang, Zuliang Han, Xing Zhou, Yihuang Chen, Lifei Zhu, Haoqian Wang

TL;DR
ConvRot introduces a novel rotation-based quantization method that significantly reduces memory and computation for diffusion transformers, enabling efficient 4-bit inference without retraining while maintaining high image quality.
Contribution
The paper presents ConvRot, a group-wise rotation-based quantization technique using Hadamard transform, and ConvLinear4bit, a plug-and-play module for efficient W4A4 inference in diffusion transformers.
Findings
Achieves 2.26× speedup and 4.05× memory reduction on FLUX.1-dev.
Maintains high image fidelity with 4-bit quantization.
First application of rotation-based quantization for plug-and-play diffusion transformer inference.
Abstract
Diffusion transformers have demonstrated strong capabilities in generating high-quality images. However, as model size increases, the growing memory footprint and inference latency pose significant challenges for practical deployment. Recent studies in large language models (LLMs) show that rotation-based techniques can smooth outliers and enable 4-bit quantization, but these approaches often incur substantial overhead and struggle with row-wise outliers in diffusion transformers. To address these challenges, we propose ConvRot, a group-wise rotation-based quantization method that leverages regular Hadamard transform (RHT) to suppress both row-wise and column-wise outliers while reducing complexity from quadratic to linear. Building on this, we design ConvLinear4bit, a plug-and-play module that integrates rotation, quantization, GEMM, and dequantization, enabling W4A4 inference without…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- This work addresses a known failure mode of global Hadamards in DiTs. - The quantization method is demonstrated on real GPUs and can provide end-to-end speedup.
- The quality and performance metrics still fall behind the state-of-the-art quantization method, SVDQuant. - Although they claimed their method could smooth both row-wise and column-wise outliers, the evaluation results still show a significant quality drop if proj_out is quantized to 4 bits, limiting the usefulness of the proposed method. - Only one workload (FLUX.1-dev) is used to evaluate the results. - Code is not provided.
1. Practical design with group-wise rotations that reduce computational cost and enable a plug-and-play, training-free W4A4 quantization scheme. 2. The method does not require a calibration dataset, which simplifies implementation. 3. Systems-level contribution via the `ConvLinear4b` kernel, bridging algorithmic design with practical engineering to foster adoption.
1. The evaluation scope is limited, focusing primarily on FLUX-1-dev. Broader testing across more diffusion transformer variants would be necessary to establish the generality of the method. 2. The paper lacks a detailed, operator-level breakdown of performance. A thorough analysis of time and memory overheads for rotation, quantization/dequantization, and GEMM would provide a clearer picture of the practical costs. 3. The paper does not sufficiently contrast the proposed technique with existing
* The paper proposes a simple yet elegant method to deal with issues introduced by Hadamard rotations for activation and weight quantization. * The paper introduces a simple Plug and Play method that can effectively reduce the memory and compute requirements of LVMs without substantially changing the architectures nor requiring a calibration/retraining procedure.
* The paper uses Outlier Amplitude as a metric to measure the effectiveness of the proposed method. However, the metric is not clearly introduced or motivated. * The comparison with QuaRot is not exhaustive since the benefits of using (block) Regular Hadamard instead of (block) Sylvester Hadamard are not compared directly in terms of end-to-end performance. * Some of the computational considerations behind the choice of the use of a convolutional matrix multiplication instead of the FHWT alg
I like the paper overall! Some strengths: * The presentation of the paper is strong, with clear motivation, illustrations, and a clear method. * The idea of deviating from the standard FHT by including the $H_4$ (Eq. 10) term is novel and useful. 0th channel outliers due to a Naive Hadamard transform is a real problem and addressing this directly makes sense to me. * The results are promising for a method without calibration/QAT.
**1. Baselines.** My major concern are the limited results. (a) Arguably the closest baselines, QuaRot or FWHT applied to a smaller size (e.g. 256), is not compared against in the end-to-end results of Table 2. Recent transformation literature, e.g. FlatQuant, OstQuant, HadaNorm, are also all missing, and may well do better than ConvRot. (b) I'm also a bit doubtful about the speed comparison w.r.t. FWHT---why would RHT be any faster? **2. Rollout.** The authors mention they use "rollout", i
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications · Advanced Neuroimaging Techniques and Applications
