EdgeFlex-Transformer: Transformer Inference for Edge Devices

Shoaib Mohammad; Guanqun Song; Ting Zhu

arXiv:2512.19741·cs.LG·December 24, 2025

EdgeFlex-Transformer: Transformer Inference for Edge Devices

Shoaib Mohammad, Guanqun Song, Ting Zhu

PDF

Open Access

TL;DR

This paper introduces EdgeFlex-Transformer, a multi-stage optimization pipeline that compresses and accelerates Vision Transformers for edge devices, reducing memory and latency significantly while maintaining accuracy.

Contribution

It presents a novel combination of activation profiling, pruning, mixed-precision, and quantization techniques to enable efficient transformer inference on resource-constrained edge hardware.

Findings

01

76% reduction in peak memory usage

02

Over 6x lower latency on CIFAR-10

03

Maintains or improves accuracy after optimization

Abstract

Deploying large-scale transformer models on edge devices presents significant challenges due to strict constraints on memory, compute, and latency. In this work, we propose a lightweight yet effective multi-stage optimization pipeline designed to compress and accelerate Vision Transformers (ViTs) for deployment in resource-constrained environments. Our methodology combines activation profiling, memory-aware pruning, selective mixed-precision execution, and activation-aware quantization (AWQ) to reduce the model's memory footprint without requiring costly retraining or task-specific fine-tuning. Starting from a ViT-Huge backbone with 632 million parameters, we first identify low-importance channels using activation statistics collected via forward hooks, followed by structured pruning to shrink the MLP layers under a target memory budget. We further apply FP16 conversion to selected…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Memory and Neural Computing · CCD and CMOS Imaging Sensors