MQuant: Unleashing the Inference Potential of Multimodal Large Language Models via Full Static Quantization
JiangYong Yu, Sifan Zhou, Dawei Yang, Shuo Wang, Shuoyu Li, Xing Hu, Chen Xu, Zukang Xu, Changyong Shu, Zhihang Yuan

TL;DR
MQuant is a novel post-training quantization framework that significantly reduces the inference latency of multimodal large language models while maintaining near-floating-point accuracy, enabling more practical deployment on resource-limited devices.
Contribution
MQuant introduces modality-specific static quantization, attention-invariant flexible switching, and rotation magnitude suppression to address unique challenges in quantizing MLLMs.
Findings
Achieves <1% accuracy degradation with W4A8 quantization.
Reduces inference latency by up to 30%.
Outperforms existing PTQ baselines on five mainstream MLLMs.
Abstract
Multimodal large language models (MLLMs) have garnered widespread attention due to their ability to understand multimodal input. However, their large parameter sizes and substantial computational demands severely hinder their practical deployment and application.While quantization is an effective way to reduce model size and inference latency, its application to MLLMs remains underexplored. In this paper, we propose MQuant, a post-training quantization (PTQ) framework designed to tackle the unique challenges of multimodal large language models (MLLMs). Conventional quantization often struggles with MLLMs because of (a) high inference latency from large visual token counts, (b) distributional disparities between visual and textual tokens, and (c) extreme outliers introduced by Hadamard-based transformations. To address these issues, MQuant introduces: Modality-Specific Static…
Peer Reviews
Decision·Submitted to ICLR 2025
1. This paper focuses on a valuable question, i.e. quantization in MLLMs. 2. Well presented with figures and tables. 3. Overall performance is superior to some LLM quantization baselines.
1. MSQ and AIFS are simply trivial adaptions of per-token dynamic quantization to MLLMs. It's better that this serves as a baseline model. 2. MSQ and MSQ + AIFS exhibit marginal improvement over the per-tensor static baseline in Table 4. 3. Please discuss the overhead of MSQ, otherwise why don't we use token-specific quantization? 4. Although MSQ + AIFS is proposed to address the token increase brought by larger resolution of images, the speedup fails to exhibit great advantages over per-token d
1. The paper is well-written and easy to follow. 2. The modality-specific quantization and Layernorm-to-RMSNorm transformation are well-motivated by the distributional differences of various modality modules and architectural designs. 3. Comprehensive experimental results are provided on various MLLMs, with comparisons to several popular recent LLM quantization methods.
1. Attention-Invariant Flexible Switching (AIFS) Scheme: The authors claim that the proposed AIFS scheme is computationally equivalent to the original attention computation. However, it is unclear whether the corresponding positional embeddings are adjusted accordingly. If not, the equivalence may not be ensured. 2. Experiment Settings: There are concerns regarding the experimental settings. In Section 5.1, the authors conducted experiments under the "text-image-text" setting with 15 textual to
1. The paper follows an intuitive approach to study MLLM quantization. The authors identify the issues based on some observations in the experiments and resolve the problem in a step-by-step manner. 2. The efficacy of the method is supported by extensive experiments. The paper shows the quantization performance of 5 mainstream MLLM models on various multi-modal tasks. The ablation studies demonstrate the usefulness of different components in maintaining the performance near the float-point base
1. The delivery of the paper needs significant improvement. The text is highly redundant. - Introduction: The content of the second last paragraph mostly overlap the main contribution part. It could be beneficial if these two parts are reorganized or condensed. - Methodology: In 4.1, there are abundant words to explain the reason why we need MSQ and AIFS and the benefits brought by these two. To me, these are intuitive and simple operations which only need concise words for explanation. For 4.2
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
