LLMForge: Multi-Backend Hardware-Aware Neural Architecture Search with Infinite-Head Attention for Edge Language Models
Xinting Jiang, Junyi Luo, Ruichen Qi, Kauna Lei, Ben Laurie, Gregory Kielian, Mehdi Saligane

TL;DR
LLMForge is a hardware-aware NAS framework that optimizes edge language models by expanding attention configurations and using surrogate ranking and multi-backend cost modeling, achieving diverse efficient architectures.
Contribution
It introduces Infinite-Head Attention, Forge-Former, and Forge-DSE, enabling hardware-conditioned NAS for edge LLMs across multiple hardware substrates.
Findings
Achieved three deployment-aware variants with different optimization goals.
Energy-optimized variant reduces energy per token by 40%.
Latency-optimized variant reduces TTFT and TPOT by 43%.
Abstract
Sub-billion-parameter Transformer language models are increasingly deployed on edge devices, where the privacy, latency, and operating-cost advantages of on-device inference are constrained by tight memory-bandwidth, energy, and thermal budgets that make architectural choice and accelerator-specific cost central to efficient inference. We present LLMForge, a hardware-aware neural architecture search (NAS) framework whose three composable contributions together make edge-LM architecture search hardware-conditioned, since different substrates impose different hardware cost bottlenecks. Infinite-Head Attention (IHA) decouples the number of query heads, KV groups, and per-head query/key and value dimensions, expanding the feasible per-layer attention configuration space by approximately 400x over grouped-query attention within our search-space ranges. Forge-Former, an encoder-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
