EdgeInfinite-Instruct: Bridging SFT-Based Optimization and NPU-Level Efficiency for Edge Devices

Jiyu Chen; Poh Seng Lim; Shuang Peng; Daxiong Luo; JungHau Foo; Yap Deep; Timothy Lee Jun Jie; Kelvin Teh Kae Wen; Fan Yang; Danyu Feng; Hao-Yun Chen; Peng-Wen Chen; Fangyuan Li; Xiaoxin Chen; Wong Wai Mun

arXiv:2508.00370·cs.CL·August 7, 2025

EdgeInfinite-Instruct: Bridging SFT-Based Optimization and NPU-Level Efficiency for Edge Devices

Jiyu Chen, Poh Seng Lim, Shuang Peng, Daxiong Luo, JungHau Foo, Yap Deep, Timothy Lee Jun Jie, Kelvin Teh Kae Wen, Fan Yang, Danyu Feng, Hao-Yun Chen, Peng-Wen Chen, Fangyuan Li, Xiaoxin Chen, Wong Wai Mun

PDF

TL;DR

EdgeInfinite-Instruct enhances large language model deployment on edge devices by combining efficient fine-tuning, instruction-following capabilities, and NPU-specific optimizations for long-sequence tasks.

Contribution

It introduces a Segmented Supervised Fine-Tuning strategy and NPU-focused deployment techniques to improve performance and efficiency on resource-constrained edge devices.

Findings

01

Improves long-sequence task performance on edge devices

02

Reduces computational costs with quantization and fixed-shape graphs

03

Maintains accuracy while enhancing efficiency on NPUs

Abstract

Deploying Transformer-based large language models (LLMs) on resource-constrained edge devices for long-sequence tasks remains challenging due to the quadratic time complexity of self-attention and growing Key-Value (KV) cache demands. While existing KV cache optimizations improve memory efficiency, they often fail to reduce time to first token (TTFT) and may degrade performance through token pruning. Alternative sequence modeling architectures address some of these limitations, but typically require full retraining and lack infrastructure support. EdgeInfinite offers an efficient solution by fine-tuning only a small subset of parameters, maintaining quality while reducing both computational and memory costs, including improved TTFT. However, its instruction-following ability is limited, and it lacks mobile-specific optimizations. To address these issues, we propose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.