TriGen: NPU Architecture for End-to-End Acceleration of Large Language Models based on SW-HW Co-Design

Jonghun Lee; Junghoon Lee; Hyeonjin Kim; Seoho Jeon; Jisup Yoon; Hyunbin Park; Meejeong Park; and Heonjae Ha

arXiv:2602.12962·cs.AR·February 16, 2026

TriGen: NPU Architecture for End-to-End Acceleration of Large Language Models based on SW-HW Co-Design

Jonghun Lee, Junghoon Lee, Hyeonjin Kim, Seoho Jeon, Jisup Yoon, Hyunbin Park, Meejeong Park, and Heonjae Ha

PDF

Open Access

TL;DR

TriGen is a novel NPU architecture designed for resource-limited devices, enabling efficient end-to-end large language model inference through software-hardware co-design, low-precision computation, and optimized scheduling.

Contribution

It introduces a co-designed NPU architecture with microscaling, LUT-based nonlinear operation acceleration, and scheduling techniques for LLM inference on constrained hardware.

Findings

01

Achieves 2.73x performance speedup over baseline

02

Reduces memory transfer by 52%

03

Maintains negligible accuracy loss

Abstract

Recent studies have extensively explored NPU architectures for accelerating AI inference in on-device environments, which are inherently resource-constrained. Meanwhile, transformer-based large language models (LLMs) have become dominant, with rapidly increasing model sizes but low degree of parameter reuse compared to conventional CNNs, making end-to-end execution on resource-limited devices extremely challenging. To address these challenges, we propose TriGen, a novel NPU architecture tailored for resource-constrained environments through software-hardware co-design. Firstly, TriGen adopts low-precision computation using microscaling (MX) to enable additional optimization opportunities while preserving accuracy, and resolves the issues that arise by employing such precision. Secondly, to jointly optimize both nonlinear and linear operations, TriGen eliminates the need for specialized…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Machine Learning in Materials Science