Resource-Efficient Transformer Architecture: Optimizing Memory and Execution Time for Real-Time Applications
Krisvarish V, Priyadarshini T, K P Abhishek Sri Saai, Vaidehi, Vijayakumar

TL;DR
This paper introduces a memory-efficient transformer model that significantly reduces memory and execution time, making it suitable for real-time, resource-constrained applications without sacrificing much accuracy.
Contribution
The paper presents a novel transformer architecture that halves embedding size and employs pruning and quantization, achieving substantial efficiency improvements over existing models.
Findings
52% reduction in memory usage
33% decrease in execution time
Outperforms MobileBERT and DistilBERT in resource efficiency
Abstract
This paper describes a memory-efficient transformer model designed to drive a reduction in memory usage and execution time by substantial orders of magnitude without impairing the model's performance near that of the original model. Recently, new architectures of transformers were presented, focused on parameter efficiency and computational optimization; however, such models usually require considerable resources in terms of hardware when deployed in real-world applications on edge devices. This approach addresses this concern by halving embedding size and applying targeted techniques such as parameter pruning and quantization to optimize the memory footprint with minimum sacrifices in terms of accuracy. Experimental results include a 52% reduction in memory usage and a 33% decrease in execution time, resulting in better efficiency than state-of-the-art models. This work compared our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmbedded Systems Design Techniques · Real-time simulation and control systems · Parallel Computing and Optimization Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Attention Dropout · Linear Layer · Softmax · Dense Connections · Linear Warmup With Linear Decay · Dropout · WordPiece · Adam
