TL;DR
ECHO-LLaMA is a novel architecture that enhances LLaMA training and inference efficiency by shared KV caching, achieving significant speedups and better resource utilization while maintaining model performance.
Contribution
It introduces shared KV caching in LLaMA models, reducing computational complexity and improving training speed and throughput without sacrificing accuracy.
Findings
Up to 77% higher token-per-second throughput during training
Up to 16% higher Model FLOPs Utilization (MFU)
Approximately 7% higher test-time throughput on 1.1B model
Abstract
This paper introduces ECHO-LLaMA, an efficient LLaMA architecture designed to improve both the training speed and inference throughput of LLaMA architectures while maintaining its learning capacity. ECHO-LLaMA transforms LLaMA models into shared KV caching across certain layers, significantly reducing KV computational complexity while maintaining or improving language performance. Experimental results demonstrate that ECHO-LLaMA achieves up to 77\% higher token-per-second throughput during training, up to 16\% higher Model FLOPs Utilization (MFU), and up to 14\% lower loss when trained on an equal number of tokens. Furthermore, on the 1.1B model, ECHO-LLaMA delivers approximately 7\% higher test-time throughput compared to the baseline. By introducing a computationally efficient adaptation mechanism, ECHO-LLaMA offers a scalable and cost-effective solution for pretraining and finetuning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
