LLMs on a Budget? Say HOLA

Zohaib Hasan Siddiqui; Jiechao Gao; Ebad Shabbir; Mohammad Anas Azeez; Rafiq Ali; Gautam Siddharth Kashyap; Usman Naseem

arXiv:2506.18952·cs.LG·October 10, 2025

LLMs on a Budget? Say HOLA

Zohaib Hasan Siddiqui, Jiechao Gao, Ebad Shabbir, Mohammad Anas Azeez, Rafiq Ali, Gautam Siddharth Kashyap, Usman Naseem

PDF

Open Access

TL;DR

HOLA is a comprehensive framework that combines hierarchical speculative decoding, adaptive retrieval, and structured pruning to optimize large language models for edge devices, achieving faster inference and reduced resource usage without sacrificing accuracy.

Contribution

HOLA introduces an integrated end-to-end optimization approach for deploying LLMs on edge devices, combining novel decoding, retrieval, and pruning techniques for improved efficiency.

Findings

01

17.6% EMA improvement on GSM8K

02

10.5% MCA improvement on ARC

03

Reduced latency and memory on Jetson Nano

Abstract

Running Large Language Models (LLMs) on edge devices is constrained by high compute and memory demands posing a barrier for real-time applications in sectors like healthcare, education, and embedded systems. Current solutions such as quantization, pruning, and retrieval-augmented generation (RAG) offer only partial optimizations and often compromise on speed or accuracy. We introduce HOLA, an end-to-end optimization framework for efficient LLM deployment. Internally, it leverages Hierarchical Speculative Decoding (HSD) for faster inference without quality loss. Externally, AdaComp-RAG adjusts retrieval complexity based on context needs. Together with LoBi, which blends structured pruning (LoRA) and quantization, HOLA delivers significant gains: 17.6% EMA on GSM8K, 10.5% MCA on ARC, and reduced latency and memory on edge devices like Jetson Nano--proving both scalable and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivate Equity and Venture Capital