CLONE: Customizing LLMs for Efficient Latency-Aware Inference at the Edge

Chunlin Tian; Xinpeng Qin; Kahou Tam; Li Li; Zijian Wang; Yuanzhe Zhao; Minglei Zhang; Chengzhong Xu

arXiv:2506.02847·cs.AR·June 4, 2025

CLONE: Customizing LLMs for Efficient Latency-Aware Inference at the Edge

Chunlin Tian, Xinpeng Qin, Kahou Tam, Li Li, Zijian Wang, Yuanzhe Zhao, Minglei Zhang, Chengzhong Xu

PDF

Open Access

TL;DR

CLONE is a co-designed algorithm-hardware approach that enables efficient, low-latency, and energy-aware deployment of large language models on edge devices, balancing performance, power, and accuracy.

Contribution

This work introduces CLONE, a novel algorithm-hardware co-design framework tailored for scalable edge hardware, optimizing LLM inference in real-time with energy efficiency.

Findings

01

Accelerates LLM inference up to 11.92x

02

Reduces energy consumption up to 7.36x

03

Maintains high-quality language generation

Abstract

Deploying large language models (LLMs) on edge devices is crucial for delivering fast responses and ensuring data privacy. However, the limited storage, weight, and power of edge devices make it difficult to deploy LLM-powered applications. These devices must balance latency requirements with energy consumption and model accuracy. In this paper, we first quantify the challenges of deploying LLMs on off-the-shelf edge devices and then we present CLONE, an in-depth algorithm-hardware co-design at both the model- and system-level that intelligently integrates real-time, energy optimization while maintaining robust generality. In order to maximize the synergistic benefits of these algorithms in always-on and intermediate edge computing settings, we specialize in a 28nm scalable hardware accelerator system. We implement and extensively evaluate CLONE on two off-the-shelf edge platforms.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare