TriGen: NPU Architecture for End-to-End Acceleration of Large Language Models based on SW-HW Co-Design
Jonghun Lee, Junghoon Lee, Hyeonjin Kim, Seoho Jeon, Jisup Yoon, Hyunbin Park, Meejeong Park, and Heonjae Ha

TL;DR
TriGen is a novel NPU architecture designed for resource-limited devices, enabling efficient end-to-end large language model inference through software-hardware co-design, low-precision computation, and optimized scheduling.
Contribution
It introduces a co-designed NPU architecture with microscaling, LUT-based nonlinear operation acceleration, and scheduling techniques for LLM inference on constrained hardware.
Findings
Achieves 2.73x performance speedup over baseline
Reduces memory transfer by 52%
Maintains negligible accuracy loss
Abstract
Recent studies have extensively explored NPU architectures for accelerating AI inference in on-device environments, which are inherently resource-constrained. Meanwhile, transformer-based large language models (LLMs) have become dominant, with rapidly increasing model sizes but low degree of parameter reuse compared to conventional CNNs, making end-to-end execution on resource-limited devices extremely challenging. To address these challenges, we propose TriGen, a novel NPU architecture tailored for resource-constrained environments through software-hardware co-design. Firstly, TriGen adopts low-precision computation using microscaling (MX) to enable additional optimization opportunities while preserving accuracy, and resolves the issues that arise by employing such precision. Secondly, to jointly optimize both nonlinear and linear operations, TriGen eliminates the need for specialized…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Machine Learning in Materials Science
