Smart but Costly? Benchmarking LLMs on Functional Accuracy and Energy Efficiency
Mohammadjavad Mehditabar, Saurabhsingh Rajput, Antonio Mastropaolo, Tushar Sharma

TL;DR
This paper introduces BRACE, a framework for benchmarking large language models on code tasks, evaluating their energy efficiency and accuracy to guide sustainable and effective model selection.
Contribution
The paper presents a novel benchmarking framework, BRACE, with two rating methods for assessing energy efficiency and accuracy trade-offs in code language models.
Findings
Models perform better in code summarization tasks.
Model size does not significantly impact ratings.
BRACE enables evidence-based model selection balancing sustainability and performance.
Abstract
The rapid advancement of AI technologies and their accelerated adoption in software development necessitates a systematic evaluation of their environmental impact alongside functional correctness. While prior studies have examined sustainability in large language models, existing approaches lack systematic frameworks for evaluating accuracy-energy trade-offs in Code Language Models (CLMs). In this paper, we present a framework, BRACE, to benchmark CLMs on a unified scale of energy efficiency and functional correctness (referred to as accuracy). We benchmark 22 state-of-the-art models on code generation and summarization tasks, proposing two rating methods: Concentric Incremental Rating Circles (CIRC) and Observation to Expectation Rating (OTER). CIRC provides deterministic Euclidean-based rankings with static trade-offs that are robust to outliers, and OTER offers trend-aware evaluation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGreen IT and Sustainability · Software Engineering Research · Software System Performance and Reliability
