ECHO-LLaMA: Efficient Caching for High-Performance LLaMA Training

Maryam Dialameh; Rezaul Karim; Hossein Rajabzadeh; Omar Mohamed Awad; Hyock Ju Kwon; Boxing Chen; Walid Ahmed; Yang Liu

arXiv:2505.17331·cs.LG·June 24, 2025

ECHO-LLaMA: Efficient Caching for High-Performance LLaMA Training

Maryam Dialameh, Rezaul Karim, Hossein Rajabzadeh, Omar Mohamed Awad, Hyock Ju Kwon, Boxing Chen, Walid Ahmed, Yang Liu

PDF

1 Video

TL;DR

ECHO-LLaMA is a novel architecture that enhances LLaMA training and inference efficiency by shared KV caching, achieving significant speedups and better resource utilization while maintaining model performance.

Contribution

It introduces shared KV caching in LLaMA models, reducing computational complexity and improving training speed and throughput without sacrificing accuracy.

Findings

01

Up to 77% higher token-per-second throughput during training

02

Up to 16% higher Model FLOPs Utilization (MFU)

03

Approximately 7% higher test-time throughput on 1.1B model

Abstract

This paper introduces ECHO-LLaMA, an efficient LLaMA architecture designed to improve both the training speed and inference throughput of LLaMA architectures while maintaining its learning capacity. ECHO-LLaMA transforms LLaMA models into shared KV caching across certain layers, significantly reducing KV computational complexity while maintaining or improving language performance. Experimental results demonstrate that ECHO-LLaMA achieves up to 77\% higher token-per-second throughput during training, up to 16\% higher Model FLOPs Utilization (MFU), and up to 14\% lower loss when trained on an equal number of tokens. Furthermore, on the 1.1B model, ECHO-LLaMA delivers approximately 7\% higher test-time throughput compared to the baseline. By introducing a computationally efficient adaptation mechanism, ECHO-LLaMA offers a scalable and cost-effective solution for pretraining and finetuning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

ECHO-LLaMA: Efficient Caching for High-Performance LLaMA Training· underline