TRAPTI: Time-Resolved Analysis for SRAM Banking and Power Gating Optimization in Embedded Transformer Inference
Jan Klhufek, Alberto Marchisio, Vojtech Mrazek, Lukas Sekanina, Muhammad Shafique

TL;DR
TRAPTI is a two-stage framework that combines cycle-level inference simulation with time-resolved memory analysis to optimize SRAM banking and power gating in embedded transformer inference.
Contribution
It introduces a novel methodology for analyzing on-chip memory dynamics over time to improve energy efficiency in transformer hardware.
Findings
DeepSeek-R1-Distill-Qwen-1.5B reduces peak memory by 2.72x compared to GPT-2 XL.
The analysis enables targeted power-gating and memory organization optimizations.
TRAPTI facilitates direct comparison of memory profiles across models.
Abstract
Transformer neural networks achieve state-of-the-art accuracy across language and vision tasks, but their deployment on embedded hardware is hindered by stringent area, latency, and energy constraints. During inference, performance and efficiency are increasingly dominated by the Key--Value (KV) cache, whose memory footprint grows with sequence length, straining on-chip memory utilization. Although existing mechanisms such as Grouped-Query Attention (GQA) reduce KV cache requirements compared to Multi-Head Attention (MHA), effectively exploiting this reduction requires understanding how on-chip memory demand evolves over time. This work presents TRAPTI, a two-stage methodology that combines cycle-level inference simulation with time-resolved analysis of on-chip memory occupancy to guide design decisions. In the first stage, the framework obtains memory occupancy traces and memory access…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
