T-MAN: Enabling End-to-End Low-Bit LLM Inference on NPUs via Unified Table Lookup

Jianyu Wei; Qingtao Li; Shijie Cao; Lingxiao Ma; Zixu Hao; Yanyong Zhang; Xiaoyan Hu; Ting Cao

arXiv:2511.11248·cs.AR·November 17, 2025

T-MAN: Enabling End-to-End Low-Bit LLM Inference on NPUs via Unified Table Lookup

Jianyu Wei, Qingtao Li, Shijie Cao, Lingxiao Ma, Zixu Hao, Yanyong Zhang, Xiaoyan Hu, Ting Cao

PDF

Open Access

TL;DR

T-MAN introduces a unified table lookup approach to enable efficient end-to-end low-bit LLM inference on NPUs, significantly improving speed and energy efficiency by overcoming hardware limitations.

Contribution

The paper proposes a novel unified table layout and tiling strategy that allows low-bit LLM inference to be performed entirely on NPUs, eliminating the need for CPU offloading.

Findings

01

1.4x speedup in prefill phase

02

3.1x speedup in decoding phase

03

84% energy savings compared to baseline methods

Abstract

Large language models (LLMs) are increasingly deployed on customer devices. To support them, current devices are adopting SoCs (System on Chip) with NPUs (Neural Processing Unit) installed. Although high performance is expected, LLM inference on NPUs is slower than its CPU counterpart. The reason is that NPUs have poor performance on computations other than GEMM, like dequantization. Current works either disaggregate prefill on the NPUs and decoding on the CPUs, or put both on the NPUs but with an accuracy loss. To solve this issue, based on the insight that low-bit can enable target computation encoded within an acceptably sized table, we propose table lookup to subsume hardware operations otherwise unsupported. To realize this, we overcome the conflicting hardware behavior of prefill and decoding to design a unified table layout and tiling through (1) fused two-level table-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis