An exploration of the effect of quantisation on energy consumption and   inference time of StarCoder2

Pepijn de Reus; Ana Oprescu; Jelle Zuidema

arXiv:2411.12758·cs.CL·November 21, 2024

An exploration of the effect of quantisation on energy consumption and inference time of StarCoder2

Pepijn de Reus, Ana Oprescu, Jelle Zuidema

PDF

Open Access 1 Repo

TL;DR

This paper investigates how quantization and pruning affect energy consumption and inference time in StarCoder2, revealing trade-offs between efficiency and accuracy in model compression strategies.

Contribution

It provides an empirical analysis of quantization and pruning impacts on energy and performance in StarCoder2, highlighting challenges and future directions.

Findings

01

Quantization increases energy demand and reduces throughput.

02

Pruning decreases energy consumption but impairs model performance.

03

Trade-offs exist between compression efficiency and accuracy.

Abstract

This study examines quantisation and pruning strategies to reduce energy consumption in code Large Language Models (LLMs) inference. Using StarCoder2, we observe increased energy demands with quantization due to lower throughput and some accuracy losses. Conversely, pruning reduces energy usage but impairs performance. The results highlight challenges and trade-offs in LLM model compression. We suggest future work on hardware-optimized quantization to enhance efficiency with minimal loss in accuracy.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ana-oprescu/greenllms
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques

MethodsPruning