LLMForge: Multi-Backend Hardware-Aware Neural Architecture Search with Infinite-Head Attention for Edge Language Models

Xinting Jiang; Junyi Luo; Ruichen Qi; Kauna Lei; Ben Laurie; Gregory Kielian; Mehdi Saligane

arXiv:2605.17653·cs.LG·May 19, 2026

LLMForge: Multi-Backend Hardware-Aware Neural Architecture Search with Infinite-Head Attention for Edge Language Models

Xinting Jiang, Junyi Luo, Ruichen Qi, Kauna Lei, Ben Laurie, Gregory Kielian, Mehdi Saligane

PDF

TL;DR

LLMForge is a hardware-aware NAS framework that optimizes edge language models by expanding attention configurations and using surrogate ranking and multi-backend cost modeling, achieving diverse efficient architectures.

Contribution

It introduces Infinite-Head Attention, Forge-Former, and Forge-DSE, enabling hardware-conditioned NAS for edge LLMs across multiple hardware substrates.

Findings

01

Achieved three deployment-aware variants with different optimization goals.

02

Energy-optimized variant reduces energy per token by 40%.

03

Latency-optimized variant reduces TTFT and TPOT by 43%.

Abstract

Sub-billion-parameter Transformer language models are increasingly deployed on edge devices, where the privacy, latency, and operating-cost advantages of on-device inference are constrained by tight memory-bandwidth, energy, and thermal budgets that make architectural choice and accelerator-specific cost central to efficient inference. We present LLMForge, a hardware-aware neural architecture search (NAS) framework whose three composable contributions together make edge-LM architecture search hardware-conditioned, since different substrates impose different hardware cost bottlenecks. Infinite-Head Attention (IHA) decouples the number of query heads, KV groups, and per-head query/key and value dimensions, expanding the feasible per-layer attention configuration space by approximately 400x over grouped-query attention within our search-space ranges. Forge-Former, an encoder-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.