TrimLLM: Progressive Layer Dropping for Domain-Specific LLMs
Lanxiang Hu, Tajana Rosing, Hao Zhang

TL;DR
TrimLLM is a novel method that reduces large language models' depth through progressive layer dropping, achieving significant inference speedup and domain-specific performance retention without hardware-dependent compression techniques.
Contribution
It introduces a layer-wise specialization-based approach for LLM compression that ensures hardware-agnostic speedup and maintains accuracy in domain-specific tasks.
Findings
Achieves 2.1-5.7x inference speedup on GPUs.
Maintains accuracy with 50-60% model compression.
Effective across various LLM sizes and domains.
Abstract
Specializing large language models (LLMs) for local deployment in domain-specific use cases is necessary for strong performance while meeting latency and privacy constraints. However, conventional task-specific adaptation approaches do not show simultaneous memory saving and inference speedup at deployment time. Practical compression techniques like quantization and pruning require dedicated hardware or kernel support to achieve measured inference speedup. We develop TrimLLM based on the layer-wise specialization phenomenon we empirically observed and verified on contemporary LLMs. TrimLLM reduces the depth of LLMs via progressive layer dropping. We show it retains LLMs' capacity in specific domains and achieves inference speedup irrespective of hardware and deep learning frameworks. We evaluated TrimLLM on LLMs of various sizes for inference; models adapted on medical, legal, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Data Storage Technologies
MethodsPruning
