EfficientQuant: An Efficient Post-Training Quantization for CNN-Transformer Hybrid Models on Edge Devices

Shaibal Saha; Lanyu Xu

arXiv:2506.11093·cs.CV·June 16, 2025

EfficientQuant: An Efficient Post-Training Quantization for CNN-Transformer Hybrid Models on Edge Devices

Shaibal Saha, Lanyu Xu

PDF

Open Access

TL;DR

EfficientQuant introduces a structure-aware post-training quantization method that significantly reduces latency and resource usage of hybrid CNN-Transformer models on edge devices with minimal accuracy loss.

Contribution

It presents a novel PTQ approach tailored for hybrid models, applying different quantization schemes to convolutional and transformer blocks for improved efficiency.

Findings

01

Achieves 2.5x to 8.7x latency reduction on ImageNet-1K.

02

Maintains high accuracy with minimal loss.

03

Demonstrates practicality on edge devices.

Abstract

Hybrid models that combine convolutional and transformer blocks offer strong performance in computer vision (CV) tasks but are resource-intensive for edge deployment. Although post-training quantization (PTQ) can help reduce resource demand, its application to hybrid models remains limited. We propose EfficientQuant, a novel structure-aware PTQ approach that applies uniform quantization to convolutional blocks and $l o g_{2}$ quantization to transformer blocks. EfficientQuant achieves $2.5 \times - 8.7 \times$ latency reduction with minimal accuracy loss on the ImageNet-1K dataset. It further demonstrates low latency and memory efficiency on edge devices, making it practical for real-world deployment.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications