Compiling Code LLMs into Lightweight Executables
Jieke Shi, Junda He, Zhou Yang, Chengran Yang, Mykhailo Klymenko, Thong Hoang (James), Xiwei Xu (Sherry), Zhenchang Xing, and David Lo

TL;DR
Ditto is a framework that compresses and compiles large language models into lightweight executables, enabling efficient local deployment on commodity hardware with minimal accuracy loss.
Contribution
It introduces a novel combination of quantization and LLVM-based compilation to optimize Code LLMs for local execution on resource-constrained devices.
Findings
Achieves up to 10.5× faster inference
Reduces memory usage by 6.4×
Lowers energy consumption by 10.5×
Abstract
The demand for better prediction accuracy and higher execution performance in neural networks continues to grow. The emergence and success of Large Language Models (LLMs) have produced many cloud-based tools for software engineering tasks such as code suggestion. Although effective, cloud deployment raises concerns over privacy, latency, and reliance on network connectivity. Running LLMs locally on personal devices such as laptops would address these issues, because it enables offline use and reduces response time. However, local deployment is challenging, since commodity devices lack high-performance accelerators such as GPUs and are constrained by limited memory and compute capacity, which makes it hard to execute large models efficiently. We present Ditto, a framework that optimizes both the model size of Code LLMs and the inference programs that execute them. Our approach…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
