EfficientQuant: An Efficient Post-Training Quantization for CNN-Transformer Hybrid Models on Edge Devices
Shaibal Saha, Lanyu Xu

TL;DR
EfficientQuant introduces a structure-aware post-training quantization method that significantly reduces latency and resource usage of hybrid CNN-Transformer models on edge devices with minimal accuracy loss.
Contribution
It presents a novel PTQ approach tailored for hybrid models, applying different quantization schemes to convolutional and transformer blocks for improved efficiency.
Findings
Achieves 2.5x to 8.7x latency reduction on ImageNet-1K.
Maintains high accuracy with minimal loss.
Demonstrates practicality on edge devices.
Abstract
Hybrid models that combine convolutional and transformer blocks offer strong performance in computer vision (CV) tasks but are resource-intensive for edge deployment. Although post-training quantization (PTQ) can help reduce resource demand, its application to hybrid models remains limited. We propose EfficientQuant, a novel structure-aware PTQ approach that applies uniform quantization to convolutional blocks and quantization to transformer blocks. EfficientQuant achieves latency reduction with minimal accuracy loss on the ImageNet-1K dataset. It further demonstrates low latency and memory efficiency on edge devices, making it practical for real-world deployment.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
