EdgeFlex-Transformer: Transformer Inference for Edge Devices
Shoaib Mohammad, Guanqun Song, Ting Zhu

TL;DR
This paper introduces EdgeFlex-Transformer, a multi-stage optimization pipeline that compresses and accelerates Vision Transformers for edge devices, reducing memory and latency significantly while maintaining accuracy.
Contribution
It presents a novel combination of activation profiling, pruning, mixed-precision, and quantization techniques to enable efficient transformer inference on resource-constrained edge hardware.
Findings
76% reduction in peak memory usage
Over 6x lower latency on CIFAR-10
Maintains or improves accuracy after optimization
Abstract
Deploying large-scale transformer models on edge devices presents significant challenges due to strict constraints on memory, compute, and latency. In this work, we propose a lightweight yet effective multi-stage optimization pipeline designed to compress and accelerate Vision Transformers (ViTs) for deployment in resource-constrained environments. Our methodology combines activation profiling, memory-aware pruning, selective mixed-precision execution, and activation-aware quantization (AWQ) to reduce the model's memory footprint without requiring costly retraining or task-specific fine-tuning. Starting from a ViT-Huge backbone with 632 million parameters, we first identify low-importance channels using activation statistics collected via forward hooks, followed by structured pruning to shrink the MLP layers under a target memory budget. We further apply FP16 conversion to selected…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Memory and Neural Computing · CCD and CMOS Imaging Sensors
