Resource-Efficient Transformer Architecture: Optimizing Memory and   Execution Time for Real-Time Applications

Krisvarish V; Priyadarshini T; K P Abhishek Sri Saai; Vaidehi; Vijayakumar

arXiv:2501.00042·cs.LG·January 3, 2025

Resource-Efficient Transformer Architecture: Optimizing Memory and Execution Time for Real-Time Applications

Krisvarish V, Priyadarshini T, K P Abhishek Sri Saai, Vaidehi, Vijayakumar

PDF

Open Access

TL;DR

This paper introduces a memory-efficient transformer model that significantly reduces memory and execution time, making it suitable for real-time, resource-constrained applications without sacrificing much accuracy.

Contribution

The paper presents a novel transformer architecture that halves embedding size and employs pruning and quantization, achieving substantial efficiency improvements over existing models.

Findings

01

52% reduction in memory usage

02

33% decrease in execution time

03

Outperforms MobileBERT and DistilBERT in resource efficiency

Abstract

This paper describes a memory-efficient transformer model designed to drive a reduction in memory usage and execution time by substantial orders of magnitude without impairing the model's performance near that of the original model. Recently, new architectures of transformers were presented, focused on parameter efficiency and computational optimization; however, such models usually require considerable resources in terms of hardware when deployed in real-world applications on edge devices. This approach addresses this concern by halving embedding size and applying targeted techniques such as parameter pruning and quantization to optimize the memory footprint with minimum sacrifices in terms of accuracy. Experimental results include a 52% reduction in memory usage and a 33% decrease in execution time, resulting in better efficiency than state-of-the-art models. This work compared our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmbedded Systems Design Techniques · Real-time simulation and control systems · Parallel Computing and Optimization Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Attention Dropout · Linear Layer · Softmax · Dense Connections · Linear Warmup With Linear Decay · Dropout · WordPiece · Adam