EfficientLLM: Efficiency in Large Language Models

Zhengqing Yuan; Weixiang Sun; Yixin Liu; Huichi Zhou; Rong Zhou; Yiyang Li; Zheyuan Zhang; Wei Song; Yue Huang; Haolong Jia; Keerthiram Murugesan; Yu Wang; Lifang He; Jianfeng Gao; Lichao Sun; Yanfang Ye

arXiv:2505.13840·cs.CL·May 21, 2025

EfficientLLM: Efficiency in Large Language Models

Zhengqing Yuan, Weixiang Sun, Yixin Liu, Huichi Zhou, Rong Zhou, Yiyang Li, Zheyuan Zhang, Wei Song, Yue Huang, Haolong Jia, Keerthiram Murugesan, Yu Wang, Lifang He, Jianfeng Gao, Lichao Sun, Yanfang Ye

PDF

Open Access

TL;DR

EfficientLLM provides a comprehensive empirical evaluation of various efficiency techniques for large language models, revealing trade-offs, task-dependent optima, and cross-modal generalization, to guide future model development and deployment.

Contribution

This study introduces the first extensive benchmark and empirical analysis of efficiency methods for large models across training, fine-tuning, and inference stages.

Findings

01

No single efficiency method is universally optimal.

02

Efficiency trade-offs depend on task and model scale.

03

Techniques generalize effectively across modalities.

Abstract

Large Language Models (LLMs) have driven significant progress, yet their growing parameter counts and context windows incur prohibitive compute, energy, and monetary costs. We introduce EfficientLLM, a novel benchmark and the first comprehensive empirical study evaluating efficiency techniques for LLMs at scale. Conducted on a production-class cluster (48xGH200, 8xH200 GPUs), our study systematically explores three key axes: (1) architecture pretraining (efficient attention variants: MQA, GQA, MLA, NSA; sparse Mixture-of-Experts (MoE)), (2) fine-tuning (parameter-efficient methods: LoRA, RSLoRA, DoRA), and (3) inference (quantization methods: int4, float16). We define six fine-grained metrics (Memory Utilization, Compute Utilization, Latency, Throughput, Energy Consumption, Compression Rate) to capture hardware saturation, latency-throughput balance, and carbon cost. Evaluating over 100…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsSoftmax · Attention Is All You Need · Mixture of Experts · Diffusion